Advanced Computing in the Age of AI | Thursday, April 25, 2024

The Mainstreaming of MLPerf? Nvidia Dominates Training v2.0 but Challengers Are Rising 

MLCommons’ latest MLPerf Training results (v2.0) issued today are broadly similar to v1.1 released last December. Nvidia still dominates, but less so (no grand sweep of wins). Relative newcomers to the exercise – AI chip/system makers Graphcore and Habana Labs/Intel – along with Google again posted strong showings. Google had four top scores basically splitting wins with Nvidia in the “available” category. The number of submitters grew to 21 from 14 in December. Overall, the results of this round were about 1.8x better performance than the last round.

This may be what early success looks like for MLPerf, the four-year-old AI technology benchmarking effort created by and run by MLCommons. The latest round of repeat submitters included Azure, Baidu, Dell, Fujitsu, GIGABYTE, Google, Graphcore, HPE, Inspur, Intel-Habana Labs, Lenovo, Nettrix, Nvidia, Samsung and Supermicro. First-time MLPerf Training submitters were ASUSTeK, CASIA, H3C, HazyResearch, Krai, and MosaicML. The most notable absentee, perhaps, was Cerebras Systems which has consistently expressed disinterest in MLPerf participation.

“MLPerf represent a diverse set of real world use cases,” said Shar Narasimhan, Nvidia director of product management for accelerated computing. “Speech recognition and language processing models are key along with recommenders, reinforcement learning and computer vision. All of the [training] benchmarks remain the same as the previous round, with the exception of object detection lightweight, which has been improved.

“We had 21 companies submit into the various MLPerf benchmarks in this particular round and they had over 260 submissions. It’s a real increase in participation and speaks well for the benchmark as a whole. The 21 submitters used four different accelerators from Google to Graphcore to Habana and Nvidia. We’re excited to see that 90 percent of the MLPerf submissions in this particular round use the Nvidia AI platform,” said Narasimhan.

Google got into the act, reporting, “Today’s release of MLPerf 2.0 results highlights the public availability of the most powerful and efficient ML infrastructure anywhere. Google’s TPU v4 ML supercomputers set performance records on five benchmarks, with an average speedup of 1.42x over the next fastest non-Google submission, and 1.5x vs our MLPerf 1.0 submission. Even more compelling — four of these record runs were conducted on the publicly available Google Cloud ML hub that we announced at Google I/O.” One Google win was in the research, not the available, category.

Nvidia (and Google) had plenty to crow about – top performance in four categories – but the MLPerf training exercise is starting to have a very different feel – perhaps less dramatic but also more useful.

Because system configurations differ widely and because there are a variety of tests, there is no single MLPerf winner as in the Top500 list. Even when Nvidia was sweeping every category – driven by its formidable technical strength and lack of competitors – it was always necessary to dig into individual system submissions to make fair comparisons among them. Nvidia GPUs were really the only accelerator game in town. Now, the (slowly) growing number ML accelerators participating in MLPerf, their improving performance versus Nvidia, and MLPerf’s effort to regularly freshen its test suite are transforming MLPerf into a broader lens through which to assess different AI accelerators and systems including cloud instances.

The analyst community also seems to think MLPerf is carving out a durable, useful role.

  • Hyperion Research analyst Alex Norton said, “Overall, I think MLPerf is growing into a more widely used and accepted tool in the AI hardware/solution space, mainly due to the large number of categories and mini-apps that are available to the providers. It allows for different technologies to highlight their specific capabilities, rather than a broad benchmark that does not necessarily reflect capabilities on specific, key applications. While not every company is putting out MLPerf results, my sense is that users are starting to want to see the MLPerf scores on the applications that are more important to them, and it may be gaining traction as a key factor in purchasing and grading new technologies.”
  • IDC‘s Peter Rutten said, “MLPerf is definitely going mainstream, and it will become an important benchmark. As a measure of performance, which is critical in HPC and AI, MLPerf has become the go-to benchmark. Some vendors will still say that the benchmark doesn’t do them justice, and that is sometimes true. As IDC predicted, the newcomers are starting to nibble at the incumbents with comparable or sometimes better MLPerf results. The market for AI processors and coprocessors is getting more and more crowded, and you could argue that these MLPerf results are indicating the end of an era in which one or two large companies dominated this space.”

Judging performance using the MLPerf benchmarks is still tricky. For example, in the top chart above from Nvidia, results for the absolute time-to-train a model by a system (or cloud instance) is shown. Google TPUv4 turned in the top performances on BERT and ResNet-50. Looking just at ResNet-50 results directly from the MLPerf spreadsheet, the two Google instances (tpu-v4-8192 and tpu-v4-6912) had 4096 and 3456 TPU chips, respectively, and delivered times of 0.191 and 0.230 minutes. An Nvidia DGX with 4216 NVIDIA A100-SXM-80GB GPUs delivered 0.319-minute time. A Graphcore system (Bow-Pod256) with 256 Bow IPUs took 2.672 minutes.

Obviously, differing system size and number of accelerators is a big factor in time-to-train. Often so is the software stack. For this latest training round, Nvidia took a stab at normalizing performance on a per-chip (slide below, *method explained at end of article) basis and says it had the fastest per-chip performance on six of the tests. The point here is it takes a lot of care to compare systems using MLPerf results. Fortunately, MLPerf has made the results spreadsheet readily available and relatively easy to slice and dice.

By way of background, MLPerf was launched in 2018 with training as its first exercise. It has been busily adding benchmark suites covering different areas (e.g. inference) with most being run twice-yearly (see chart below). The lineup now also includes four inference suites and an HPC training exercise. Currently, MLPerf is developing an AI-relevant storage benchmark it hopes to launch in 2022. How all of these different benchmarks fare long-term is unclear, but their very emergence is more evidence of the AI technology wave rolling through IT, including HPC.

For this round of training, MLPerf added a new object detection benchmark that trains the new RetinaNet reference model on the larger and more diverse OpenImages dataset. MLPerf says the new test more accurately reflects state-of-the-art ML training for applications such as collision avoidance for vehicles and robotics as well as retail analytics. “I’m excited to release our new object detection benchmark, which was built based on extensive feedback from a customer advisory board and is an excellent tool for purchasing decisions, designing new accelerators and improving software,” said David Kanter, executive director of MLCommons, parent organization of MLPerf.

Each set of benchmarks typically has two divisions: “The Closed division is intended to compare hardware platforms or software frameworks ‘apples-to-apples’ and requires using the same model and optimizer as the reference implementation. The Open division is intended to foster faster models and optimizers and allows any ML approach that can reach the target quality,” states MLCommons.

Overall, said Kanter, “If we compare the best results of this round to sort of the high watermark previously, there’s about 1.8x better performance. So that’s a pretty significant move forward.”

Read the rest of the article at HPCwire.

EnterpriseAI