Advanced Computing in the Age of AI | Friday, June 9, 2023

Moving Machine Learning from Test to Production: High Performance Data Management 

via Shutterstock

If advanced scale computing were a high school, machine learning would be the coolest kid at the cool kid lunch table, and that kid’s name would be Nvidia. Already a Wall Street sweetheart coming into 2017, Nvidia stock has gone into orbit this month with earnings far in excess of expectations and the revelation that SoftBank has amassed a staggering $4 billion ownership of Nvidia shares (with rumors of more SoftBank money on the way). This followed the launch two weeks ago of the company’s massive new “Volta” GPU processor technology for powering machine learning applications.

Good on you, Nvidia.

But there’s more to machine learning than rocket-fueled GPU (or other accelerated processor) technology. It’s called systems balance. The HPC industry has dealt with the problem of imbalanced systems for decades: processing power outstrips the surrounding data management, memory and fabric technologies that comprise the application environment. Sure, the GPU is the star of the machine learning show, but focusing on the processor at the expense of everything else leads to systems that are lopsided – or “FLOPsided,” in the phrase of industry watcher/punster Steven Conway, of Hyperion Research.

Data-hungry algorithms need high performance data management systems that can feed the processor beast. Without the three technologies working concomitantly in concert, machine learning proof-of-concept projects that show promise in the test phase fall over when scaled out to the production phase.

That’s the picture presented by DataDirect Networks, the high performance data storage vendor that is well established on the Top500 list of the world’s most powerful supercomputers and has reported healthy growth in enterprise advanced scale verticals as well.

“Machine learning projects in the commercial space, and we’re mostly talking about deep learning projects, tend to not take into account the immense scale that they’re going to reach very quickly if they’re successful,” Laura Shepard, DDN’s senior director of product marketing, told EnterpriseTech. “So we see these types of projects being prototyped in an environment that’s not sustainably scaled to the performance and capacity that’s required for successful projects of that type in production.”

It’s a common scenario for data scientists to build a prototype application using a few hundred terabytes of test or training data, then put the application into production and scale up to tens of petabytes in a matter of three to six months. Failure to take into account the storage and I/O side of the equation, Shepard said, tends to happen in scenarios driven from a data science perspective.

“The successful machine learning program scales in multi-dimensions very rapidly at the performance level and in terms of the incremental types of data that get brought in to improve the outcomes of the (machine learning) techniques used,” Shepard said. “This means that massive amounts of mixed I/O are required for the success of programs at scale…when one of these projects goes into production and becomes wildly successful.”

Shepard made these comments with DDN’s announcement this week that the company is gaining traction in the machine learning market, participating in large commercial programs in manufacturing, autonomous vehicles, smart cities, medical research and natural-language processing.

For machine learning applications at scale, DDN said it “delivers up to 40X faster performance than competitive enterprise scale out NAS and up to 6X faster performance than enterprise SAN solution.” The company said it also allows machine learning and deep learning programs to start small and scale to production-level performance and petabytes per rack with no additional architecting required.

“With DDN, we can manage all our different applications from one centrally located storage array,” said Joel Zysman, director of advanced computing at the Center for Computational Science at the University of Miami, “which gives us both the speed we need and the ability to share information effectively. Plus, for our industrial partnership projects that each rely on massive amounts of instrument data in areas like smart cities and autonomous vehicles, DDN enables us to do mass transactions on a scale never before deemed possible. These levels of speed and capacity are capabilities that other providers simply can’t match.”

Addison Snell, Intersect360 Research

Addison Snell of Intersect360 Research said DDN and other high performance storage vendors are positioned to address the need for storage and data management capabilities that complement the demands of machine learning.

“The big trend we’re seeing is that those (storage vendors) focused on performance are taking share away from the more enterprise storage players,” Snell told EnterpriseTech. “You see a partition of the enterprise storage that’s for HPC use. But high performance applications (such as machine learning) increasingly are demanding the organizations make investment in high performance storage as well.”

Snell said the machine learning market, broadly speaking, is growing beyond the test phase and also beyond deep-pocketed, heavily resourced hyperscalers.

“Coming into this year the majority of deep learning and machine learning deployments have been at the hyperscale providers, who have this infrastructure they can rely on to run things at scale. And to the extent that non-hyperscalers were experimenting with (machine learning) they were doing this predominantly in public cloud environments, so they were leverage hyperscaler resources.

“This year we’re seeing increasing numbers of organizations starting to do proof-of-concept of on-premises machine learning experimentation, but as you try to scale out that proof-of-concept you either need to move it to the cloud to get those resources, or invest not only in high performance computing but also in high performance data management internally in order to leverage all the data that you have.”

This is particularly true for data sensitive high performance vertical markets, such as oil and gas and financial services, Snell said.

“To be successful, machine learning programs need to think big from the start,” said DDN’s Shepard. “Prototypes of programs that start by using mid-range enterprise storage or by adding drives to servers often find that these approaches are not sustainable when they need to ramp to production. With DDN, customers can transition easily with a single high-performance platform that scales massively. Because of this, DDN is experiencing tremendous demand from both research and enterprise organizations looking for high-performance storage solutions to support machine learning applications.”