Advanced Computing in the Age of AI | Friday, April 26, 2024

Software’s Role in System Design Resurrected at Hot Chips 

The software market will be close to three times larger than the hardware market in 2026, and that fact wasn't lost on Intel CEO Pat Gelsinger at the Hot Chips conference this month.

Software will drive hardware development, specifically chips, as complex systems drive the insatiable demand for computing power, Gelsinger said during his keynote at the conference.

"Software has become way more important. We have to treat those as the sacred interfaces that we have, then bring hardware under and that's where silicon has to fit into it," Gelsinger said.

The importance of software in silicon development saw a revival at the Hot Chips conference. Many chip designs presented were developed baked in the concept of hardware-software co-design, which emerged in the 1990s to ensure "meeting system level objectives by exploiting the synergism of hardware and software through their concurrent design," according to a paper published by IEEE in 1997.

Software is moving the industry forward with new styles of compute such as AI, and chipmakers are now taking a software-first approach in hardware development to support the new applications.

The idea of software driving hardware development isn't new, but it has been resurrected for the era of workload accelerators, said Kevin Krewell, an analyst at Tirias Research.

"We've had FPGAs since the 1980s and those are software-defined hardware. The more modern interpretation is that the hardware is an amorphous collection of hardware blocks that are orchestrated by a compiler to perform some workloads efficiently, without a lot of extraneous control hardware," Krewell said.

Chip designers are taking up hardware-software co-optimization to break down the walls between the software tasks and the hardware it runs on, with a goal to gain greater efficiency.

"It's popular again today because of the slowing of Moore's Law, improvements in transistor speed and efficiency, and improved software compiler technologies," Krewell said.

Pat Gelsinger via Hot Chips livestream

Intel is trying to keep up the software's insatiable computing demand by engineering new types of chips that can scale computing going forward.

"People develop software and silicon has to come underneath it,” Gelsinger said.

He added chip designers “also need to consider the composition of the critical software components that come with it, and that combination, that co-optimization of software [and] hardware becomes essentially the pathway to being able to bring such complex systems.”.

Gelsinger said software is indirectly defining Intel's foundry strategy and the abilities for the factories to turn out newer types of chips that cram multiple accelerators in one package.

For example, Intel has put in 47 computing tiles – also called chiplets -- including GPUs inside a code-named Ponte Vecchio, which is for high-performance computing applications. Intel has backed the UCIe (Universal Chiplet Interconnect Express) protocol for die-to-die communication in chiplets.

“We're going to have to do co-optimizations across the hardware and software domain. Also across multiple chiplets -- how they play together,” Gelsinger said.

A new class of EDA tools are needed to build chips for systems at scale, Gelsinger said.

Intel also shed some light on the "software defined, silicon enhanced" strategy, and tied it closely to its long-term strategy of becoming a chip manufacturer. The goal is to plug in middleware in the cloud that is enhanced by silicon. Intel is proposing subscription features to unlock the middleware and silicon that boosts its speed.

Software can make data-center infrastructure flexible and intelligent via a new generation of smartNICs and DPUs, which are compute intensive chips with networking and storage components.

Networking hardware architecture is at an inflection point, with software-defined networking and storage functions shaping hardware design, said Jaideep Dastidar, who is a senior fellow at AMD, who presented at the Hot Chips conference.

AMD talked about the 400G Adaptive smartNIC, which includes software-defined cores and fixed-function logic such ASICs to process and transfer data.

Software elements are helping these chips take on a diverse set of workloads, including on-chip computing offloaded from CPUs. The software also gives these chips the flexibility to adapt to new standards and applications.

"We decided we're going to take the traditional hardware-software co-design paradigm and extend it to hardware software programmable-logic co-design," Dastidar said.

The chip has added ASIC to programmable logic adapters, where one can layer in customizations such as custom header extensions, or add or remove new accelerator functions. The program logic adapters -- which could be FPGAs defining ASIC functions -- can also do full custom data plane offload.

The 400G Adaptive smartNIC also has programmable logic agents to interact with the embedded processing subsystem. The chip has software to program logic adapter interfaces to create coherent IO agents to interact with embedded processor subsystems, which can be tweaked to run the network control plane. Software allows data plane applications to be completely executed either in the ASIC or the programmable logic, or both.

AI chip company Groq has designed an AI chip in which software takes over control of the chip. The chip hands over chip management to the compiler, which controls hardware functions, code execution, data movement and other tasks.

The  Tensor Streaming Processor Architecture includes integrated software control units at strategic points to dispatch instructions to hardware.

Groq uprooted conventional chip designs, reexamined hardware-software interfaces, and designed a chip with AI-like software controls to handle chip operations.  The compiler can reason about the correctness and schedule instructions on the hardware.

"We explicitly turn over control to the software, specifically the compiler so that it can reason about the correctness and schedule instructions on the hardware,” Abts said.

Groq used AI techniques -- in which decisions are made based on patterns identified in data from probabilities and associations -- to make determinations on hardware functionality. That's different from conventional computing, in which decisions are made logically, which can lead to wastage.

"It wasn't about abstracting away the details of the hardware. It's about explicitly controlling the underlying hardware, and the compiler has a regular view of what the hardware is doing at any given cycle," Abts said.

Systems are getting more complex, with tens of 1000s of CPUs, GPUs, smartNICs, FPGAs, being plugged into heterogeneous computing environments. Each of these chips profile differently in response time, latency and variation, which could slow down large-scale applications.

"Anything that requires a coordinated effort across the entire machine will ultimately be limited by the worst-case latency across the network. What we did is try to avoid some of this waste, fraud and abuse that crops up at the system level," Abts said.

Abts gave an example of a traditional RDMA request, where typically issuing a read to destination results in a memory transaction, which then flows the reply back across the network where it can later be used.

“A much more simplified version of this is where the compiler knows the address that's being read. And the data is simply pushed across the network at the time it's needed so that it can be consumed at the source. This allows for a much more efficient network transaction with less messages on the network and less overhead," Abts said.

The concept of spatial awareness appeared in many presentations, which involves cutting down the distance data has to travel. Proximity between chips and memory or storage was a common thread in AI chip designs.

Groq made fine-grained changes in its basic chip design by decoupling primary computing units in CPUs, such as integer and vector execution cores, and bundling them into separate groups. The proximity will speed up processing integer or vector processing, which are used for basic computing and AI tasks.

The reordering has much to do with how data travels between processors in AI, Abts said.

EnterpriseAI