Advanced Computing in the Age of AI | Tuesday, July 23, 2024

NUMA’s Revenge: The Resurgence of Shared Memory Systems 

hp-shared-memoryHewlett-Packard may be known as the volume player in the X86 server business, which has been dominated by two-socket servers for the past decade and a half. But the company also has lots of expertise in building much heftier shared-memory systems.

With Project Odyssey, HP is taking the engineering from its Superdome 2 systems, which have a history that dates back to the other side of the dot-com boom and also reaches way back into the Convex supercomputer business that HP acquired back in 1995, and applying it to X86-based machines. Suffice it to say, the HP, Digital Equipment, and Convex roots, the company knows big iron. And it also knows that after decades of distributed computing, some customers are starting to take a hard look at massive shared memory systems again. And they are doing so because their applications are driving them to.

To get a sense of what is happening at the top-end of the systems market, EnterpriseTech sat down with Kirk Bresniker, who is chief technologist for HP's Business Critical Systems division and who has been designing HP's largest systems since 1993, initially based on HP's own PA-RISC processors, then for Itanium processors from Intel, and now, with the future "DragonHawk" systems, for Intel's upcoming "Ivy Bridge-EX" Xeon E7 processors.

Timothy Prickett Morgan: What are the current needs among large enterprises when it comes to shared memory systems? How big is big enough and how are they addressing scalability and reliability needs? How does it work out there in the real world, and how is that changing?

Kirk Bresniker: I think we are still seeing a continuation of the traditional solutions, which are large-scale machines and high availability clusters, which includes products like our HP ServiceGuard in metro clusters or continental clusters to achieve disaster tolerance and high availability. We are seeing for some of those big databases, the traditional model is continuing on. A lot of that data was accumulated over a long period of time, and the applications above it are tailored to those databases. So there is a little bit of inertia there. We are seeing Superdome 2 systems and HP-UX operating system carry through. But those are not the greenfield applications.

I think what we are seeing now is something else. If you had asked me a year and a half ago would I see a need for a large system with a shared memory address space on the X86 side, I might have thought there were a couple of uses cases. Maybe people would move traditional Unix databases in time over to a new platform, where they wanted to absolutely minimize their transition investment by moving from big scale-up Unix to big scale-up X86. That is really what our "Project Odyssey" effort was about.

TPM: But the funny thing is that I am seeing, somewhat unexpectedly, a resurgence in big NUMA servers. I mean, people still call them SMP systems out of habit, but they are really NUMA boxes.

hp-kirk-bresnikerKirk Bresniker: Something different has emerged here. We started seeing it first with our ProLiant DL980, which has eight "Westmere" Xeon E7 sockets, and with the plan with our "DragonHawk" product, where we are bringing Superdome scalability to the next-generation Intel Xeon. And it has really been driven by SAP's efforts to drive its HANA in-memory database not just for analytics, which has traditionally been a scale-out application, but to online transaction processing, which is really calling out for that big memory space.

This is sort of changing the tide. We have seen people realize that they can move from millisecond latency on even moderate-sized clusters to nanosecond latencies collapsing it down to a single shared memory platform.

With DragonHawk, with hundreds of cores of next-generation Xeon processors and tens of terabytes of memory, that is a huge fat node. And for some clustering applications, and it doesn't matter if it is a scientific process or a business process, they model into a 3D space and a shared memory system is better. The easiest one for me to get my head wrapped around is weather modeling, but it can also be a complex, multidimensional business processes.

You can certainly simulate the weather or the business process with a cluster and use an InfiniBand network to put this over a large number of nodes. But if communication is your limiting factor, and the model is 10 TB or 20 TB in size, and these upcoming platforms have that kind of capacity, and you can take MPI or any other message-passing method you want to come up with and pass them on the memory. That gives you tens to hundreds of nanoseconds of latency over hundreds of cores and tens of terabytes of main memory.

I think this is starting to be interesting to non-HPC users.

TPM: So do you think that financial modeling, risk management, and similar workloads will be moved onto such shared memory machines? I mean, time is money to these companies.

Kirk Bresniker: Really, what it comes down to is that even the best low-latency Ethernet switch, it is a couple hundred nanoseconds from port to port. And that is buying a super-expensive machine. But what that is not counting is that you also have to go through the Ethernet stack on both ends, and that is a substantial piece of software and makes that ping-pong latency more like 2,000 or 3,000 nanoseconds. And if I compare that to being able to use a shared memory operation, where it might be 100 or 200 nanoseconds total, that is a cache-to-cache kind of latency. So if your model fits within the memory and the core count, then you could have a very cost efficient and space efficient collapsing of a cluster down to one of these large memory machines.

This is sort of back to the future.

TPM: That was why we wanted to talk about this.

Kirk Bresniker: It is simple. If communication is not your limiter in your application, then go cheap and cheerful. Use Ethernet or InfiniBand. If you need 1,000 nodes of a couple hundred gigabytes apiece, then that model is not going to fit in a DragonHawk. But if the application is dominated by communication and latency, then a shared memory system is something to look at.

The other attractive thing about this, since this is a back-to-the-future thing, is that as people go back through their MPI codes and to the days when these large memory systems ruled the Top 500 list, they are going to find code stubs in there to do efficient message passing on shared memory machines. They may not have used them in the past fifteen years, but chances are that they are still in there. And after the journey they took from shared memory to clusters, I don't think they will find it too hard to turn back the clock and reutilize some of these things.

HP's current Superdome machine

HP's current Superdome machine

TPM: How far can you push these shared memory systems architecturally? HP has pretty sophisticated chipsets and you can push up to 32 or 64 sockets and maybe even as far as 128 sockets, although it has been a long time since anyone has even talked about a machine that large. How far can you push the main memory?

Kirk Bresniker: The first thing we bump up against is not the number of CPU sockets we can stitch together on a fabric, but the number of physical address bits that our microprocessor vendors are handing out. Intel is at 46 bits and Advanced Micro Devices is a little bit ahead with 48 bits.

It is not as simple as adding a couple more bit lines on the microprocessor for the address registers. As you cross over that next line, you have to add another level on your TLBs [translation lookaside buffers, which are caches that hold the data for virtual-to-physical memory address translations]. Then you have to redo your memory handler. It actually ends up being an intrusive change.

With the systems that we have already talked about on our roadmap, if we doubled again, we would exceed the physical address space. So we are bumping up against that.

That being said, I don't think anyone is going to let chip makers hold that back. One of the people on my team was at HP when they had 16-bit microprocessors and we needed to get to a 32-bit address space. So there are proven techniques around to deal with this.

What will be interesting is for us to look at a massive expansion of either direct or windowed access memory, and what will be that memory. We are running into the penultimate generation of flash, the penultimate generation of DRAM, and we have all of these new technologies coming on such as spin-transfer torque RAM, phase change memory, and HP's own memristor. We have several of these threads coming together, and we have the potential to see memory scale once again. We are running up against the limits of the current technologies and architectures, but I think that people are going to be clever enough to go beyond those physical memory limits.

To bring this full circle, for the enterprise customer, they will have not only large pools of memory, but large pools of persistent memory. With these customers, the integrity and persistency of the data is a lot more important than with other customers. Not that this is not important in high-performance computing and scientific simulations. But if I am running a simulation, I can always run it again. If I am tracking a live process model of a business, I don't always get a do-over.

With photonics to stitch these large, persistent memory pools together, there may be a synthesis of what we think of as shared memory systems and shared-nothing models. We might be picking and choosing the best of both breeds on the same systems when we apply them to business processes. That is not to say that there will not continue to be massively parallel scale out problems, but this could be a different style.