Advanced Computing in the Age of AI | Thursday, March 28, 2024

AI Turns up the Heat in the Data Center 

Sometimes, the laws of physics can be so annoying, and AI is a case in point.

There is a theoretical, experimentally verified upper limit on how many computations can be processed per kilowatt-hour.[1] It’s called Landauer’s Principle: more processing generates more heat. Thus, compute-intensive AI is a veritable furnace within the data center, and there’s only so much that can be done to turn down the temperature. High-density AI infrastructure is outstripping the cooling capabilities of existing facilities, and the trend will continue.

As enterprises move more aggressively into AI—with implementations growing fourfold over the past four years—they will soon face essential questions about where to house their AI hardware so it can keep its cool.

The Cloud Makes It Someone Else’s Problem

The first instinct is often to turn to a public cloud, an understandable but not always optimal solution. It’s true that Amazon, Apple, Google and other hyperscale providers offer technology that’s easier for smaller enterprises to afford today, and offloading the cooling-related issues is an added bonus.

For all its advantages, however, the cloud has downsides. Some IT professionals still worry about maintaining confidential data off-site, as well as potential impacts of cloud or connectivity outages. The most troubling issues of all are latency and cost.

Some AI applications, such as facial recognition at airports, should run locally because time required to send data to a central cloud would compromise effectiveness. And although this may seem an isolated use case, IBM asserts that “Organizations that are deriving the most value from data are building their data management and AI platforms close to where the data resides.”

Call it the “edge for AI.”

Even for latency-tolerant applications, there remains the issue of cost. Michael Dell, for one, has long espoused the view that a public cloud isn’t the place for predictable workloads (i.e., ones that don’t require significant elasticity). Renting space in a cloud will almost always cost more than owning the infrastructure, he says, so unless the cloud’s elasticity is required, on-premises or colocation generally makes the most financial sense over the long run. AI, Dell says, will cost less in your own data centers. Of course, his statements assume that enterprises are building modern infrastructures, and getting there can be a significant undertaking.

The Actual Demands of AI

If a public cloud isn’t the answer, what does it take for an enterprise to run AI on its own? Some lower-end applications can still use CPUs, but more advanced systems are better suited for GPUs, ASICs and FPGAs.

This transition typically means that companies once supporting up to 7 kW per rack are now preparing for at least 30 kW and often 50 kW per rack. However, it’s commonly understood that fan cooling loses viability somewhere over 15 kW. Google, for example, found their existing cooling solutions couldn’t keep up with the heat generated by Tensor Flow 3.0, and other organizations will eventually experience the same problem.

The obvious solution is liquid cooling, not widely used since the days of room-sized mainframes or the occasional overclocker’s custom creation. Some “plug and play” liquid cooling systems are being developed to help enterprises update existing data centers on an as-needed basis without significant facilities impacts. But with Gartner predicting that 30 percent of data centers will soon no longer be economical to operate,[2] enterprises with high AI ambitions will want to look at more comprehensive solutions to drive resource and cost efficiencies.

In-House vs. Colocation

This brings us back to the longstanding “in-house vs. colocation” question. AI puts the issue in a different light, but many familiar factors will steer corporate decision-making.

Usually leading the field of concerns for the on-premises option will be the cost of and expertise needed to retrofit or construct a data center meeting AI-oriented specifications. After years of dire predictions from pundits claiming “the data center is dead,” CIOs may be surprised to find themselves once again searching for capital to support such a build-out.

Moreover, as the industry finds its footing regarding liquid cooling and AI infrastructure in general, the lack of time-tested best practices will make planning more difficult and projects more risk-intensive. Presently, some of the most up-to-date guidance is coming from custom-purpose designs by hyperscale providers, which may or may not translate directly to the enterprise.

Colocations offer an interesting “best-of-both” alternative to on-premises and to public clouds because they allow organizations to retain ownership of the data and hardware while handing off the facilities concerns to specialists. The cost savings of sharing high-volume power and internet contracts with other renters may be attractive. Large colocation providers are also motivated to test and deploy innovative technologies, including new liquid cooling systems, even when they delivery only fractional efficiencies. Enterprises can then benefit without risking early adoption themselves.

Unfortunately, the main barrier to colocation for AI is likely to be the same as for the cloud: latency. Sometimes there’s just no replacement for having systems on-premises. Whereas organizations today are accustomed to loading large data sets into analytics applications, genuine AI will involve a more fluid data fabric to support the neural network. In cases where large amounts of data are generated on-site and must be immediately integrated, a colocation provider at a distance may not be viable.

There is no one-size-fits-all home for AI to suit all organizations. Most smaller companies, and those for which AI is a sideline interest, will continue to use a cloud. For organizations becoming fully invested in AI, there will often be financial and operational advantages in owned infrastructure. Cost, expertise, and latency are the most likely factors to tip the balance between on-premises and colocation.

[1] http://energy.mit.edu/news/energy-efficient-computing/

[2] https://www.gartner.com/en/doc/3746424-100-data-and-analytics-predictions-through-2021

Michael Cantor is chief information officer at Park Place Technologies, a provider of data center maintenance and support services.

EnterpriseAI