Advanced Computing in the Age of AI | Saturday, May 18, 2024

Nvidia Dominates MLPerf Inference, Qualcomm also Shines, Where’s Everybody Else? 


MLCommons today released its latest MLPerf inferencing results, with another strong showing by Nvidia accelerators inside a diverse array of systems. Roughly four years old, MLPerf still struggles to attract wider participation from accelerator suppliers. Nevertheless, overall performance and participation was up, with 19 organizations submitting twice as many results and six times as many power measurements in the main Inference 2.0 (datacenter and edge) exercise. Bottom line: AI systems are steadily improving.

David Kanter, executive director of MLCommons, the parent organization for MLPerf, said: “This was an outstanding effort by the ML community, with so many new participants and the tremendous increase in the number and diversity of submissions. I’m especially excited to see greater adoption of power and energy measurements, highlighting the industry’s focus on efficient AI.”

The MLPerf benchmark comes around four times a year, with inferencing results reported in Q1 and Q3 and training results reported in Q2 and Q4. Of the two, model training is more compute-intensive and tends to fall into the HPC bailiwick; inferencing is less so, but still taxing. The latest inference round had three distinct benchmarks: Inference v2.0 (datacenter and edge); Mobile v2.0 (mobile phones); and Tiny v0.7 (IoT).

MLPerf divides the exercises into Divisions and Categories to make cross-system comparisons fairer and easier, as shown in the slide below.

To a large extent, MLPerf remains mostly a showcase for systems based on Nvidia’s portfolio of accelerators. The question that’s dogged MLPerf is: “where’s everyone else?” Given the proliferation of AI chip start-ups with offerings and rumblings by CPU makers — notably Intel and IBM — that their newer CPUs have enhanced inference capabilities, one would expect more participation from a growing number of accelerator/system providers.

Karl Freund, founder and principal at Cambrian AI Research, said: “My take is that most companies don’t see enough ROI from the massive effort required to run these benchmarks. I count some 5,000 results in the spreadsheet. That’s a lot of work, and only Nvidia and Qualcomm are making the investments. I also fear that the next iteration is more likely to see two vendors dwindle to one than increase to three. I hope I’m wrong, as I think there is tremendous value in both comparative benchmarks and — even more importantly — in the improvements in software.”

Qualcomm is a MLPerf rarity at the moment in terms of its willingness to vie with Nvidia. It turned in solid performances, particularly on edge applications where it dominated. Its Qualcomm Cloud AI 100 accelerator is intended not only to be performant, but also power-efficient, and that strength was on display in the latest round.

During Nvidia’s press/analyst pre-briefing, David Salvator (senior product manager, AI Inference and Cloud) acknowledged Qualcomm’s strong power showing. “There are a couple of places on the CNN-type networks where frankly, Qualcomm has delivered a pretty good solution as it relates to efficiency. With that said, we outperform them on both workloads and, in the case of SSD-Large, by a factor of about three or four. … So really substantial performance difference, if you sort of put that in the context of how many servers would it take to get to equivalent performance, that really sort of cuts into their per-watt advantage.”

There were a few other accelerators in the mix across the three benchmark suites. NEUCHIPS‘ FPGA-based RecAccel was used on DLRM (Deep Learning Recommender Models). FuriosaAI entered results from a Supermicro system using the FuriosaAI Warboy chip in the Closed Edge category. Krai worked with Qualcomm and had Qualcomm Cloud AI 100-based entries in the datacenter and edge categories.

Intel, which participated in the Closed Inference Division (apples-to-apples) in the last round, didn’t do so this time; instead, it opted for the Open Division, which allows greater flexibility of system components and hardware and isn’t deemed an apples-to-apples comparison.

Intel has recently been touting its newer CPU’s enhanced inference capability. IBM has made similar claims for Power10, but hasn’t participated in MLPerf. Graphcore, which did participate in the last MLPerf training exercise and says its technology is good for both training and inferencing, did not participate in this exercise. There are, of course, many other AI chip newcomers whose offerings are designed to speed up training and inferencing.

Maybe AMD will jump into the fray soon. It took direct aim at Nvidia earlier this spring by touting an improved version of the AMD/Xilinx FPGA-based inferencing card (VCK5000) as having a significantly better TCO for inferencing than most of Nvidia’s lineup. Such frontal assaults are unusual. AMD said while there wasn’t time to submit the improved VCK5000 to the most recent MLPerf inferencing exercise, it planned to do so in future exercises.

All of the preamble aside, Nvidia remains king of the AI accelerator hill, at least in terms of the datacenter and broadly available offerings. Much of Nvidia’s briefing focused on its latest Jetson AGX ORIN device and its performance at the edge. Software was again a key driver of performance gains, and Salvator highlighted Nvidia’s Triton software platform, which was used both with Nvidia-based systems as well as with submissions based on AWS instances using its Inferentia processor rather than Nvidia accelerators.

Nvidia’s big A100 GPU got a little less attention, beyond Salvator touting its multi-instance capability (MIG). “We take a single workload from the benchmark, and we run it in isolation in one MIG instance. That is our baseline number. Having done that, we then go ahead and load up the rest of the GPU with the rest of the benchmarks, plus an additional copy of one test to fill out that seventh MIG instance. We’re basically lighting up the whole part. What’s great about this, is we’re doing all this and delivering essentially the same performance. [There’s only] about a two percent performance hit, which is almost within the range of [expected] variation on these benchmarks,” he said.

There were a couple of changes to the most recent MLPerf Inference components and procedure. One was to shorten the time required to run the tests. Kanter explained: “We made a change in the rules that allows every benchmark, more or less, to be run in under 10 minutes. And that required a bunch of statistical analysis and work to get it right. But this has shortened the runtime for some of the benchmarks that are running on lower performance systems. You know, there are people who are submitting on Raspberry Pi’s. And this allows them to do it in a much timelier fashion.”

Another change was swapping in a new dataset – KITS 2019 for BraTS 2019 – in the 3D medical imaging test. The new dataset is more taxing, as Salvator noted, “KITS actually is a series of high-resolution images of kidney tumors. And if you’ll look at scores in the 3D unit tests, from [Inference] version 1.1 to 2.0, you’re going to see that the numbers came down quite a bit. A lot of what’s driving that is the dataset now being used is much more taxing on all the systems being tested.” BraTS was a less complex brain tumor scan dataset.

Clearly MLPerf continues to evolve, which is a good thing. The various test suites are shown in the slides below followed.

Here’s MLPerf’s bulleted list of highlights for this year’s results by category excerpted from the announcement:

  • “The MLPerf Inference benchmarks primarily focus on datacenter and edge systems and submitters include Alibaba, ASUSTeK, Azure,, Dell, Fujitsu, FuriosaAI, Gigabyte, H3C, Inspur, Intel, Krai, Lenovo, Nettrix, Neuchips, NVIDIA, Qualcomm Technologies, Inc., Supermicro, and ZhejiangLab. This round set new records with over 3,900 performance results and 2,200 power measurements, respectively 2X and 6X more than the prior round, demonstrating the momentum of the community.
  • “The MLPerf Mobile benchmark suite targets smartphones, tablets, notebooks, and other client systems with the latest submissions highlighting an average 2X performance gain over the previous round. MLPerf Mobile v2.0 includes a new image segmentation model, MOSAIC, that was developed by Google Research with feedback from MLCommons. The MLPerf mobile application and the corresponding source code, which incorporates the latest updates and submitting vendors’ backends, are expected to be available in the second quarter of 2022.
  • “The MLPerf Tiny benchmark suite is intended for the lowest power devices and smallest form factors, such as deeply embedded, intelligent sensing, and internet-of-things applications. The second round of MLPerf Tiny results showed tremendous growth in collaboration with submissions from Alibaba, Andes, hls4ml-FINN team, Plumerai, Renesas, Silicon Labs, STMicroelectronics, and Syntiant. Collectively, these organizations submitted 19 different systems with 3X more results than the first round and over half the results incorporating energy measurements, an impressive achievement for the first benchmarking round with energy measurement.”

The results are intended to provide system buyers with reasonably fair guidance for purchasing and to provide systems builders with insight about their products and how they compare to others. MLPerf has made perusing the results spreadsheet online relatively easy. That said, it’s still not easy.

With apologies for the “eye chart” figures below, they will give you a sense of what pouring through the MLPerf spreadsheet entails. Shown here are top results in the Closed Inference (datacenter) Division. This was a simple sort done on ResNet results. The full spreadsheet has many column headers for all of the tests. There is also a column with links to more details for a given entry on Github.

Acknowledging the challenge, Kanter said, “We’ve got thousands of results in Inference (test suite) and tens of them in the Tiny category. One of the ways to look at this is [to] try to find results that are similar in some dimensions, and then look at the ways that they vary. So, you could look at same software, different hardware, you could look at same hardware, different software, you could look at changes over time for the benchmarks that are compatible.”

MLCommons allows participants to submit statements of work (~300 words) intended to briefly describe their MLPerf submission and effort. While many are just pitches, several are useful in describing the system/components/software used. As they are relatively brief, they may provide a useful glimpse before plunging into the data tables. Those statements are appended to the article.

Link to datacenter results:

Link to mobile results:

Link to tiny results:

Header image: Nvidia Jetson AGX ORIN

This article first appeared on sister publication HPCwire.