Tesla Expands Its GPU-Powered AI Supercomputer – Is Dojo Next?
Using Selene’s Top500 submission as a proxy, we estimate that Tesla’s 7,360-GPU cluster would be capable of about 100 double-precision Linpack petaflops, though we expect that Tesla is running mainly single- and lower-precision workloads (FP32, FP16, bfloat16, etc.).
An even larger AI supercomputer – from Meta/Facebook – was detailed earlier this year. The AI Research SuperCluster (RSC) will employ 16,000 A100 GPUs, delivering more than 200 double-precision petaflops, once completed this summer.
The Tesla GPU system reveal came last June from Andrej Karpathy, the senior director of AI at Tesla, at the 4th International Joint Conference on Computer Vision and Pattern Recognition (CCVPR 2021). “I wanted to briefly give a plug to this insane supercomputer that we are building and using now,” Karpathy said. At the time, the system spanned 720 nodes, each powered by eight Nvidia A100 GPUs (the 80GB model), for a total of 5,760 A100s. At eight GPUs per node, the infusion of another 1,600 GPUs adds 200 nodes to the installation for a total 920 nodes.
News of the upgrade came via a tweet from Tim Zaman, an engineering manager at Tesla – part of a promotion for the upcoming MLSysConf. Tesla is sponsoring the conference, which runs from August 29, 2022, through September 1, 2022. The company is also holding its second AI Day event on September 30, 2022.
Tesla’s GPU clusters are prologue to the company’s upcoming, homegrown Dojo supercomputer, which has been in development since August 2020, when Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.”
The design of Dojo was revealed at Tesla’s inaugural AI Day event last August, when details of the system and its constituent D1 chip surfaced. Tesla may soon be ready to spill some additional Dojo tea next week at Hot Chips. The (all-virtual) event kicks off Sunday (August 16) and goes through Tuesday, August 23, 2022. Tesla has three slots on the program, all on Tuesday. In the morning, Tesla hardware engineer Emil Talpes is scheduled to give a presentation, titled “Dojo: The Microarchitecture of Tesla’s Exa-Scale Computer,” followed by Tesla’s Principal System Engineer for Dojo Bill Chang, with his talk, “Dojo – Super-Compute System Scaling for ML Training.”
Later in the same day, Ganesh Venkataramanan, senior director of autopilot hardware at Tesla, will deliver a keynote talk, “Beyond Compute – Enabling AI through System Integration.” That is the second of two keynotes being featured at Hot Chips 2022; the other one (“Semiconductors Run the World”) will be given by Intel CEO Pat Gelsinger on Monday, August 22.
Several technologies are competing to power the fastest AI supercomputers in the world. In addition to market leader Nvidia’s GPUs, GPUs from AMD now power the world’s fastest (publicly-ranked) supercomputer, Frontier. And Intel is working to release its Ponte Vecchio GPU, the primary engine for the future Aurora supercomputer. Custom chips are taking off as well: Google is on its fourth-generation TPUs; Microsoft has invested in FPGAs for running AI workloads; and Amazon has launched its Trainium and Inferentia chips for AI.