Better, More Efficient Neural Inferencing Means Embracing Change
Artificial intelligence is running into some age-old engineering tradeoffs.
For example, increasing the accuracy of neural networks typically means increasing the size of the networks. That requires more compute resources and leads to increased power consumption. The size of neural networks doubles every 3.5 months. Left unchecked, it is forecast that 15 percent of the world's power output will be consumed by data centers in the future – driven by the demand for AI.
While training workloads consume a lot of compute, it is a finite problem, with training happening in sporadic batches as new data and models are trained. Inference, on the other hand, tends to be used continuously and has demanding latency requirements. For example, data centers that provide image analysis need to expand the number of instances to handle peak demand on those data-rich files, especially if there are customers with mission-critical applications.
Currently most of the compute done for neural network inference is based on the von Neumann architecture – usually in the form of CPUs and GPUs. The von Neumann architecture has served generalized compute needs for more than a half-century. But with the advent of AI workloads and the slowing of Moore’s Law, a new compute architecture is needed to efficiently perform AI inferencing.
Solving this dilemma is important, as inference acceleration in the datacenter using AI accelerators is estimated to be an over $25 billion market by 2025.
Using GPUs for neural network compute was a major innovation and caused AI to blossom in the mid- 2010s. But as the compute loads expanded, the only solution was to scale with more and more massively parallel graphics processors (GPUs) or other specialized von Neumann-like architectures seen in custom chips designed by large cloud service providers.
But with the explosion of data, the rapidly-expanding interest in AI applications has raised questions about whether the tried-and-true will scale because it is expensive when it comes to energy.
The bottleneck in scaling these technologies lies in memory. Moving the large volumes of coefficients (weights) and data involved between memory and the processor wastes a great deal of energy in the traditional von Neumann approach. In current architectures, 90 percent of the energy for AI workloads is consumed by data movement, transferring the weights and activations between external memory, on-chip caches and finally to the computing element itself.
There are two reasons for this. One is that neural net inference workloads are shaped differently from legacy compute requirements. In a traditional workload, a processor fetches a single, relatively small piece of data and does a large portion of the desired computation using it. With neural nets, it is the opposite; with every step requiring parallel computation of a large amount of data with each piece of data contributing little to the conclusions.
Also, while transistors have shrunk in area by many orders of magnitude, the wire lengths have shrunk only linearly, and the overall size of high-end processors remains roughly the same. That means the energy used inside a chip has transitioned from being dominated by the transistors doing the computation to the wires that get the data to them.
To scale AI and ML and lay the foundation for new, more powerful applications, we need to approach the problem in a fundamentally different way by focusing on data movement. Distance is energy, so every feature an engineer adds swells the silicon area. This means data must move farther, which slows latency, and the overall cost of the chip increases because of the larger real estate.
Improving Latency While Scaling Compute Efficiency
Put another way, innovators in the AI space want to improve the amount of computing they can achieve within a power envelope. By doing more computing, they can do better inference and therefore deliver more powerful applications.
In a von Neumann architecture, after the expensive memory read, the data must still flow through a high-speed memory bus with the associated energy cost of sending it that distance. The energy cost of high-speed SRAM and long fast buses means that even in the most advanced processes traditional architectures might perform no better than 2.5 TOPs per watt, before factoring in the cost of DRAM access. The energy cost for retrieving coefficient data gets even higher if you need to go off chip to DRAM. Most companies do not include this energy in their advertised calculations, but it is a large energy cost and DRAM access is essential to making use of their architecture.
Focus on Memory
One proven breakthrough approach to tackle the problem of data movement is at-memory computation, a purely digital technique centered on minimizing data movement. By tying hundreds of thousands of processing elements directly to small, individual blocks of dedicated SRAM, the throughput of each individual block can be much lower while still providing extremely high memory bandwidth. Running standard SRAM cells next to the processing elements at a lower speed allows them to be run at a much lower voltage, while also cutting energy costs for memory accesses. This can reduce power consumption for data transfer by up to six times.
Once coefficients and activations are loaded into SRAM, the distance the data must be moved is extremely short because the SRAM block and the processing element are physically abutted.
By maximizing the amount of SRAM available, and keeping the coefficients on chip, there is no external DRAM load and store, meaning that the advertised energy of our solution is the total energy it consumes.
This jump in efficiency enables unprecedented compute density. A single PCIe card can run at 2 PetaOperations per second (POPs) in under 300 watts, a server can now support 20 PetOps of AI compute, and a single 42u rack can support 200 POPs of AI acceleration.
The overriding challenge of the AI era is how to provide the vastly greater amounts of compute needed, which is much more than traditional computing applications have ever required. By moving the compute element adjacent to memory cells, at-memory architecture provides unrivaled compute density, accelerating AI inference for diverse neural networks such as vision, natural language processing and recommendation engines. And the efficiency of at-memory computation enables more processing power to be delivered in a PCI Express form factor and power envelope than is otherwise possible.
This kind of breakthrough will herald accelerated innovation in AI that was not conceivable just a few years ago.
About the Author
Darrick Wiebe is the co-founder and head of technical marketing with high-performance AI chip company Untether AI. Wiebe is a software zealot, and his software is in use in a wide range of markets including military and data centers. He previously founded two startups, LightMesh and XN Logic, developing and leveraging graph database application framework technologies.