Advanced Computing in the Age of AI | Friday, April 19, 2024

Taking AI development and test to the cloud 
Sponsored Content by Dell Technologies | Intel

Many organizations leverage cloud-based HPC resources to develop and test AI models, and then move the production models to on-premises systems — for lots of good reasons.

For enterprises looking to gain a competitive edge with artificial intelligence systems, it’s off to the races. Everyone is trying to get there first, which means that everyone has a need for speed in their software development and testing processes. This sense of urgency is one of the reasons organizations increasingly look to the cloud for fast access to flexible pools of high performance computing resources.

For example, if your organization wants to train a neural network to drive a recommendation engine or a natural language processing application, you might need access to dozens of compute nodes, some accelerators, fast interconnects and high-speed storage. In a case like this, cloud-based HPC systems can give you the resources you need with no upfront capital costs and no requirement to build and maintain an in-house cluster.

That’s all part of the goodness of using the cloud as a starting point for an AI journey. But what happens after you have your deep-learning model trained and tested and your application is ready to go into production? In many cases, organizations move the production AI application to their on-premises data centers. Bringing the application in house allows them to avoid the high costs of renting cloud-based HPC resources on an ongoing basis, while better utilizing their existing HPC resources and staff expertise.

This is a point that is underscored in a report from the research firm Moor Insights & Strategy.

“Most startups begin their AI journey using cloud-hosted services because it’s easier to spin up a GPU-equipped instance, upload training data, and begin to develop the neural network model than it is to plan, procure, and install the necessary hardware and software,” the firm notes. “However, many or even most startups will quickly outgrow this stage and reach the point where renting is no longer more affordable than owning the infrastructure.”[1]

Moor points out that most enterprises already have a substantial IT organization running in on-premises or co-located data centers, and would typically conduct a total cost of ownership analysis to determine the best place to host a production AI system. In many cases, this analysis will hinge on the expected utilization rates of the GPUs and the scope and ramp of the organization’s AI journey, Moor notes.

“Since those are usually unknown factors in the early stages of research and development, many enterprises rightly choose to start their AI journey in the cloud and then move to their own hardware once they have production models and begin to keep the servers and GPUs busy,” the firm says.

A few other insights from the analysts at Moor:

  • Spinning up a server or two with GPUs is incredibly easy to do in a public cloud infrastructure. In addition, the major cloud service providers have assembled impressive suites of software and pre-trained neural networks to further simplify the on-ramp to their cloud data centers.
  • Many organizations will eventually need significant computing infrastructure for AI and HPC as their applications begin to run at scale. This, along with data transfer and throughput fees, begins to tip the cost balance in favor of building on-premises infrastructure as the organization matures in AI.
  • The need for compute, storage and networking speed is further magnified when AI training runs begin to demand tens (or even hundreds) of servers and GPUs. At this point, the benefits of easy cloud startup are dwarfed by the costs of dedicated cloud Infrastructure as a Service (IaaS).

A boost from the National Science Foundation

For organizations that want to leverage cloud-based HPC resources for research projects, the National Science Foundation is doing its part to help people get into the game. The NSF has provided nearly $30 million in new funding for research in data science and engineering through its “BIGDATA” program. These NSF awards are paired with support from Amazon Web Services, Google Cloud Platform and Microsoft Azure, which each committed up to $3 million in cloud resources for relevant BIGDATA projects over a three-year period.

The BIGDATA program funds novel research in computer science, statistics, computational science and mathematics that seeks to advance the frontiers of data science. The program also supports work on innovative applications that leverage data science advances to enhance knowledge in various domains, including the social and behavioral sciences, education, biology, physical sciences and engineering.

“A key goal of this collaboration is to encourage research projects to focus on large-scale experimentation and scalability studies,” NSF says.[2]

Key takeaways

All things considered, the cloud can be the right place to get started with AI development and test projects. You can explore your concepts and prove the viability of your approaches without making the substantial investments required for the deployment of on-premises HPC clusters.

But then as your project grows, chance are you will see compelling arguments for bringing your AI  workloads into your on-premises data center, creating a hybrid cloud environment.

To learn more

For a deeper dive, read the Moor Insights & Strategy paper “AI and HPC: Cloud or On-Premises Hosting.” Find out how you can stand out from the crowd: offering high-performance computing as a service with Intel. Explore AI and HPC Cloud options from Dell Technologies. Learn more about unlocking the value of data with artificial intelligence systems with AI solutions from Dell Technologies and Intel.


[1] Moor Insights & Strategy, “AI and HPC: Cloud Or On-Premises Hosting,” February 2019.

[2] National Science Foundation, “Leading cloud providers join with NSF to support data science frontiers,” February 7, 2018.

 

EnterpriseAI