Advanced Computing in the Age of AI | Monday, June 24, 2024

Nvidia Announces New AI Supercomputer ‘Eos’ 

At GTC22 today, Nvidia unveiled its new H100 GPU, the first of its new ‘Hopper’ architecture, along with a slew of accompanying configurations, systems and accompanying technology and software. To show off these advances, they also unveiled a new, massive supercomputer set to debut somewhere in the United States in a few months: Eos, named for the Greek goddess of the dawn.

Eos is based on the fourth-generation DGX system — the DGX H100 — that was also launched today, and which is powered by octuple NVLink-connected H100 GPUs (more on all of that here). An external NVLink switch can then connect these DGX H100s into Pods, which offer up to an exaflop of AI performance and can themselves be linked in 32-node increments to form systems like Eos.

In total, Eos (pictured in a rendering in the header) will contain 18 of these 32-DGX H100 Pods, for a total of 576 DGX H100 systems; 4,608 H100 GPUs; 500 Quantum-2 InfiniBand switches; and 360 NVLink switches. “Eos will offer an incredible 18 exaflops of AI performance, and we expect it to be the world’s fastest AI supercomputer when it’s deployed,” said Paresh Kharya, senior director of product management and marketing at Nvidia, in a prebriefing for press and analysts.

The DGX H100 Pod. Image courtesy of Nvidia.

As each H100 delivers 30 teraflops of peak FP64 (IEEE) compute power, the traditional HPC peak works out to 138.2 FP64 petaflops, while Nvidia’s FP64 tensor core processing format doubles that HPC peak performance to 275 petaflops. 18 exaflops of AI performance may make Eos the most performant AI supercomputer, but it remains to be seen if it will best other "AI supercomputers" on the Linpack metric. (To learn more about a few of the other systems that are planning to be the world’s fastest AI supercomputer, read coverage like this or this.)

Eos will be leveraged by Nvidia’s internal AI development and software engineering teams for its products, including autonomous vehicles and conversational AI software. Eos will also power Nvidia-led research projects in areas like climate science and digital biology. "When we've got workloads that can really benefit from the H100, and recommenders and language models, now, obviously, that workload will be first on Eos," Charlie Boyle, vice president and general manager of DGX Systems at Nvidia, told HPCwire.

But Nvidia also, of course, intends Eos to pave the road for clients to build similarly large systems. Boyle said that while “[Nvidia wants] the best tools for our research and development teams to use internally,” the more important part for Nvidia’s customers is that “we have the exact copy of what they’re running. … And the advantage of building one thing and the advantage of building our own supercomputers out of that one thing is, pretty much no matter what size a customer has of their system, we've got an equivalent system or bigger in-house.”

The sheer scale of Eos, he continued, will make it a boon for scaled-up testing. “Not only is Eos a great tool for our own internal users, but it also helps us make sure that everything that we're putting out to our customers is rock-solid. … And when they have questions, it's super easy to replicate. You know, our support team has their own set of DGXs that they use, but if they need something bigger, they just today call up the Selene team — and tomorrow, you know, it'll be the Eos team — and say, you know, ‘hey, can you try this?’”

Eos is set to succeed Selene, which was named for the Greek goddess of the moon and sister of Eos (is it too early to place a bet on a “Helios” system?), and which debuted in seventh place on the Top500 in June 2020. Nvidia said it took just a few weeks to build Selene, which is based on 280 of its DGX A100 systems and is based at Nvidia’s Santa Clara headquarters. The preceding system, Circe, was based on 36 DGX-2H systems.

While Eos is set to debut within the year — ostensibly within the next few months — Ian Buck, vice president of hyperscale and HPC at Nvidia, said in a live GTC session that no H100-based systems will appear on the June Top500 list, as H100 hardware will not begin shipping until Q3.