Advanced Computing in the Age of AI | Sunday, September 24, 2023

It’s time for radically simplified AI and ML model training 
Sponsored Content by HPE

Here’s the good news: Artificial intelligence (AI) has graduated from being an emerging technology used by large, technically advanced companies to a solution that practically any company can use to drive their business forward.

And here come the challenges: As an ever-increasing number of companies decide to leverage the power of AI across their enterprises, they’re facing some significant challenges along the way at critical points when it comes to model development and training.

Rewriting model code and managing infrastructure can definitely be barriers to AI model training at scale – and the reasons behind this are multifold. For one thing, training deep learning models is complex. AI and machine learning (ML) workloads have different infrastructure needs and specialized infrastructure. For example, GPUs are usually required to train AI/ML models. Handling all this leaves ML engineers focused more on managing infrastructure than developing and training models.

What’s more, myriad technology choices with differing enabling software and infrastructure are available in the marketplace today, complicated by the fact that cloud vendors and specialized hardware vendors are often locking companies into their platform options.

What if you could scale AI model training from idea to impact – with minimal code rewrites or infrastructure changes?

There’s more good news: The recently announced HPE Machine Learning Development System is a flexible system that’s purpose-built for AI and created specifically for turnkey model development and training at scale. The system includes an innovative software platform, high-performance computing, networking, accelerators, start-up services, and full solution support. It’s designed to enable companies to achieve their vision of AI as a core competency, enabling rapid development, iteration, and scaling of high-quality models from proof-of-concept to production. A closer look at specific features reveals where the real benefits are found.

#1 Pre-configured, fully installed, and performant out of the box

Out-of-the-box performance means reduced IT complexity with on-site installation, configuration, and standard model setup. This gives ML engineers more time to focus on model development.

#2 Seamless scalability, distributed training, and hyperparameter optimization

Companies can perform deep learning across GPU clusters with minimal code changes​. Distributed training delivers near-linear scaling performance to multiple GPUs. As indicated by HPE research, hyperparameter optimization is up ~100x faster* than existing approaches. What’s more, GPU costs are more manageable.

#3 Experiment tracking

Collaboration is improved between ML engineers, saving engineers an estimated one day per week.

#4 Manageability and observability

Infrastructure and system resource utilization are monitored by workloads.

#5 Enterprise-level support and services dome from a trusted vendor

With the HPE Machine Learning Development System comes access to an experienced talent pool of AI, HPC, and IT experts. The supply chain is predictable and secure, and the software stack is continuously evolving and improving.

Real use cases demonstrate real benefits

An early customer required a new solution for developing large natural language processing models for both training and inference. The chosen HPE Machine Learning Development System was composed of 64x HPE Apollo A6055 Gen10 Plus Systems with NVIDIA® A100 Tensor Core (80 GB) GPU with NVLink, Mellanox InfiniBand HDR switch, Aruba 6300 1GbE switch, HPE Parallel File Storage System (PFSS) with IBM Spectrum Scale, HPE Performance Cluster Manager, Red Hat® Enterprise Linux® (RHEL), and the HPE Machine Learning Development Environment. Development was done on premises and in the cloud. The resulting solution delivers faster outcomes that matter, including model parallelism, customized hyperparameter optimization, and experiment tracking for collaboration. The solution is also delivering faster time to value.

A company in the pharmaceutical industry was looking to evolve cell research for drug discovery by studying the impact of light wave lengths on various cell components as a way to better understand and then synthesize the various components. Modern automation is critical to this technique to accelerate the understanding of what's truly happening in the cell images. The company found that adopting deep learning techniques with the support of the HPE Machine Learning  Development Environment helped increased accuracy of the modeling techniques from an average of 80% to over 99%, according to internal HPE results.*

And in the autonomous vehicle sector, one company is building computer vision models for things like pedestrian detection and stoplight detection. They were at the point where they needed the ability to scale, going from running jobs on a few dozen GPUs to running hundreds of single jobs on hundreds of GPUs in a high-performance fashion. This volume of big jobs would run every day, continually collecting rich data from cars to improve models to meet enhanced safety goals. The company worked with HPE to deliver a balanced, optimized system incorporating the right kind of accelerator, networking fabric, and storage technology, complemented with advisory services. The result is a purpose-built custom-designed version of the HPE Machine Learning Development Environment, tailored in this case to precisely suit the workload.

HPE brings new breakthrough AI solutions for speeding data-first modernization from edge to cloud, enabling scaling up AI to industrial-sized global applications. We make AI that is data-driven, production-oriented, and cloud-enabled – available anytime, anywhere and at any scale. Our solutions support today’s enterprises as well as financial services, health and life sciences, and manufacturing. Ideal for most every industry, the new HPE Machine Learning Development System lets you scale AI model training from idea to impact with minimal code rewrites or infrastructure changes.

Because it arrives preconfigured, the system can take as little as one day to set up and get going. 

*Research results: 2020 MLSys paper; JMLR on Hyperband paper