Advanced Computing in the Age of AI | Friday, March 29, 2024

SC19: IBM Bets on New HPC-AI Game Plan 

IBM is known for making big bets. Summit supercomputer – a big win. Red Hat acquisition – looking like a big win. OpenPOWER and Power processors – jury’s out. At SC19, long-time IBM’er Dave Turek, vice president technical computing OpenPOWER, sketched out a different kind of bet for Big Blue – a small ball strategy, if you’ll forgive the baseball analogy – in which the idea isn’t to sell big machines or infrastructure packages (those are welcome, of course). Instead, IBM is rolling out an effort to “supercharge” existing installed HPC infrastructures using IBM AI expertise in systems with small footprints (i.e. price tags).

The idea is simple. Deploy small Power-based systems, perhaps as small as a single node, that take in application data and operations data from the client’s infrastructure, run it through IBM’s AI software – Turek emphasizes IBM’s Bayesian engine (IBM Bayesian Optimization) – to speed up applications and deliver value fast, at low cost and little disruption.

“Just to drive home the point of what our ambition is; [it] is to produce solutions that are minimally or zero disruptive on the installed base, okay, just so that we can shrink that sales cycle and make life as easy as possible for whatever the client is and what they’re doing,” Turek told us at an SC19 briefing last week. One can imagine IBM introducing systems optimized for one or another domain (manufacturing, finance, EDA, security, IT systems management, etc.) that use AI to dramatically speed up tasks running on the host infrastructure. None were announced.

Indeed, there were no new major IBM system announcements or major Power10 processor updates at SC19. Summit (148.6 PF Rmax) and Sierra (94.6 PF Rmax), both IBM machines, retained their spots atop the Top500. Instead, there was a short white paper on IBM’s forthcoming Bayesian software, which IBM says can speed-up and improve accuracy for all manner of simulations. IBM, for example, used it to reduce the number of EDA simulations by 79 percent on Power10 microprocessor development task.

Dave Turek of IBM

“What I’ve been saying that I’ve not made explicit,” Turek said, “is that the kind of transition I’ve characterized here is also an articulation of the emergence of organic innovation coming from inside IBM. That maybe [in the past] some people looked at AI and said, well, AI is all about Nvidia’s GPU and Mellanox networks and Google’s TensorFlow and stuff like that. Those are like table stakes, if you will. But it’s actually the stuff that we’re working on here, which has all the value. I can’t deploy Nvidia GPUs and give you an eight times speed-up painlessly, right? That’s going to be a big bill. But if I can deploy a small Power cluster, got a few GPUs in it, and make your applications run 20 times faster.”

This idea of taking client data and using various AI capabilities to turn it into actionable insight isn’t new at IBM. That was at least part of the core IBM Watson strategy. At SC18, Turek, whose title just changed to vice president of high performance and cognitive computing, IBM Cognitive Systems, outlined a similar idea. The most recent formulation seems a more concrete plan with specific products to follow. The IBO software, for example, is currently in beta and will be generally available in the June 2020 timeframe, according to Turek.

It’s not the case that IBM is turning its back on major system business, said Turek. In the last year or so it has, for example had significant recent success in academia with “MIT, Princeton University, University of Miami, University of Tennessee, RPI, TU Dresden, etc.,” noted Turek. Summit and Sierra, of course, were giants DoE wins and there was some surprise IBM didn’t win at least one of the forthcoming exascale machines. Energy giant Total’s Power9-based Pangea (17.9 Linpack petaflops) is among the largest commercial HPC systems in the world.

The challenge from a business perspective is that deals for such large systems are scarce, and OpenPOWER systems and Power microprocessor sales have not caught fire. Building a new ecosystem and dethroning x86 turned out to be a formidable task. But with the new strategy, basically every existing substantial infrastructure is a potential sales target. For starters, wrangling through tedious capital budget cycles won’t be necessary.

“What we’re trying to do strategically is get away from this rip-and-replace phenomenon that’s characterized HPC since the beginning of time, get away from the lag in time of what it takes to get capital approvals and site preparation, all that stuff,” said Turek.

IBM shared few technology details about the new initiative beyond its current Bayesian centricity. AI, of course, encompasses a variety of machine learning, deep learning and analytics technologies. Leveraging these, for example, to improve traditional HPC tasks such as modeling and simulation is an area of intense work right now across the HPC community.

Broadly, Bayesian approaches utilize real-world results and experience to inform models. The approach dates back to its inventor Thomas Bayes (c. 1701-1761). IBM is pitching Bayesian optimization techniques as bringing intelligence to simulation. The excerpt below is from IBM’s brief paper promoted at SC19 which briefly discusses examples from EDA, drug discovery, and computational chemistry:

“[To] close the efficiency gaps with traditional search methodology, optimization algorithms are bringing intelligence to the design and deployment of computational experiments. Bayes equation, and the statistics behind that equation, provides guidance on the most probable parameter set to advance the exploration of the response function. Bayesian optimization answers the question “based on the limited information I have, what is the best thing for me to do next?”

As always, the market will decide based on technology, value, and ease of use. For the HPC community the approach sounds a little too “black boxish” but perhaps not. Skirting lengthy often contentious processes to free up capital budgets is always a plus if the cost is sufficiently low. There’s also the hope/expectation for follow-on sales of more nodes and other required infrastructure, storage for example.

Here’s Turek’s boiled-down pitch:

“So the Bayesian approach is much better because you have the sequential information passing back and forth between the simulation and Bayesian methods that give you information to orchestrate the way the simulations are run. So we take a solution. It’s a small power cluster. It’s got the Bayesian software on it. You have an existing, I don’t know, a Haswell cluster, a few years old, running simulations, and because your last dose of capital from your enterprise was five years ago, you can’t get another bunch of money. What do you do?

“What we would do is bring in our cluster, put it in the data center, and bring up a database and give access to that database from both sides. So [a] simulation runs, it puts output parameters into the database. I know something’s arrived, I go and inspect that, my machine learning algorithms analyze that, and it makes recommendations of the input parameters. So next simulation: rinse and repeat. And as you progress through this, the Bayesian machine learning solution gets smarter and smarter and smarter. And you get to get to your end game quicker and quicker and quicker. So empirical results: In the chemical formula formulation problem (that he cited). They reduced the amount of compute by two thirds. In discovery for new drugs for disease, they reduced the compute by 95 percent.

“I put it in a four-node cluster adjacent to my 2,000 node cluster, and I make my 2,000-node cluster behave as if it was an 8,000-node. cluster? How long do I have to think about this?”

Stay tuned.

This article originally appeared in sister publication HPCwire.

EnterpriseAI