Advanced Computing in the Age of AI | Friday, April 19, 2024

Machine Learning Infrastructures Require Scale to Spare 

Used across a wide range of processes to improve or replace human input, machine learning is attractive because it is able to addresses problems never before tackled due to the prohibitively large volumes of data involved.

But when it comes to managing these often massive data sets, it’s important to think big from the start to ensure long term success. In fact, the larger the data set, the more potential value that machine learning brings.

As an example, at the University of Miami’s Center for Computational Sciences (CCS) scientists are working with the city of Miami on a machine-learning project designed to improve service and maintenance for Miami’s famous Beach Walk to drive better strategies for services and maintenance schedules, to provide real-time insights that can help improve public safety and service responses. The Beach Walk has a 30-block radius with a variety of terrain and hundreds of ingress points, so the project is using sound and light as a proxy for gathering “people-movement” data.

Data samples are gathered 10-times-per-second from more than 100 sensors, and the data goes into CCS’s shared HPC infrastructure designed to support virtually any big data analysis and management scenario. The volume of data would have been inconceivable a few years ago. CCS uses an analytics cluster for state mapping of population size and movement patterns over time to inform more effective maintenance planning. A second, smaller machine learning cluster is using the same data to monitor how people are moving in real-time and to predict what they may do and what real-time support may be needed, such as police support, ambulatory support or trash collection and general maintenance.

Kurt Kuckein of DDN

With the exception of projects running at shared high power compute (HPC) centers, most artificial intelligence (AI) projects start as prototypes either built on white box storage or on existing enterprise storage. These projects usually start small, involve mixed IO workloads and, if successful, get much bigger, much faster than expected. As a result then run into problems with scale.

Scaling failures can show up in a variety of ways, from the inability to deliver data access at the required speed to the inability to scale data storage in a footprint that’s cost-effective and easy to manage. Any of these failures can derail advances of the overall program because if inputs and the depth of deep learning networks can’t grow, outputs can’t scale.

Unfortunately when a project reaches this point, a complete re-tooling is required. Some try side-by-side silos as a way to avoid re-tooling programs. They copy the non-scaling architecture and point half of the computation at the new gear, but this doubles the number of environments to manage. As a result, any shared inputs or outputs now require twice the storage space and twice the management. This approach also adds significant latency to job completion when results from the first silo are required inputs into the second silo. This approach only grows in cost and complexity when the program needs to add a third or fourth silo.

Successful projects that avoid this re-tooling period are the ones where infrastructure owners take the time — usually near the end of the prototyping phase — to think about various potential data scenarios: low, expected, and high requirements for data access, computation, retention, and protection.

This advice may sound generic, but it is not. Even the “low” data growth curve in production machine learning environments dwarfs the vast majority of data generation in non-machine learning workflows. As an example, in machine learning projects we’ve seen, observed data growth rates for production environments have ranged from 9x to 16x per year, making appropriate sizing and scaling options critical. Data growth is going to be such that 100 percent reliance on a fast tier will not be practical in the vast majority of cases. A good storage strategy should have a strong fast-tier option, along with integration with archive on-premises and hybrid cloud. Proven scaling and performance under load for similar workloads is a good indicator that you are on the right track.

When designing an infrastructure for machine learning and AI projects, look for scalable, high-performance storage that has the intelligence to manage flash and active archive. Choose a system that can handle medium- and high-potential outcomes and that can grow performance and capacity without requiring silos. Finally, make sure the suppliers you select offer advanced technology for the flash era and have the expertise to help you plan strategically for long term success.

Kurt Kuckein is director, product management, at DataDirect Networks.

EnterpriseAI