Advanced Computing in the Age of AI | Friday, August 12, 2022

NUMA’s Revenge: The Resurgence of Shared Memory Systems 

hp-shared-memoryHewlett-Packard may be known as the volume player in the X86 server business, which has been dominated by two-socket servers for the past decade and a half. But the company also has lots of expertise in building much heftier shared-memory systems.

With Project Odyssey, HP is taking the engineering from its Superdome 2 systems, which have a history that dates back to the other side of the dot-com boom and also reaches way back into the Convex supercomputer business that HP acquired back in 1995, and applying it to X86-based machines. Suffice it to say, the HP, Digital Equipment, and Convex roots, the company knows big iron. And it also knows that after decades of distributed computing, some customers are starting to take a hard look at massive shared memory systems again. And they are doing so because their applications are driving them to.

To get a sense of what is happening at the top-end of the systems market, EnterpriseTech sat down with Kirk Bresniker, who is chief technologist for HP's Business Critical Systems division and who has been designing HP's largest systems since 1993, initially based on HP's own PA-RISC processors, then for Itanium processors from Intel, and now, with the future "DragonHawk" systems, for Intel's upcoming "Ivy Bridge-EX" Xeon E7 processors.

Timothy Prickett Morgan: What are the current needs among large enterprises when it comes to shared memory systems? How big is big enough and how are they addressing scalability and reliability needs? How does it work out there in the real world, and how is that changing?

Kirk Bresniker: I think we are still seeing a continuation of the traditional solutions, which are large-scale machines and high availability clusters, which includes products like our HP ServiceGuard in metro clusters or continental clusters to achieve disaster tolerance and high availability. We are seeing for some of those big databases, the traditional model is continuing on. A lot of that data was accumulated over a long period of time, and the applications above it are tailored to those databases. So there is a little bit of inertia there. We are seeing Superdome 2 systems and HP-UX operating system carry through. But those are not the greenfield applications.

I think what we are seeing now is something else. If you had asked me a year and a half ago would I see a need for a large system with a shared memory address space on the X86 side, I might have thought there were a couple of uses cases. Maybe people would move traditional Unix databases in time over to a new platform, where they wanted to absolutely minimize their transition investment by moving from big scale-up Unix to big scale-up X86. That is really what our "Project Odyssey" effort was about.

TPM: But the funny thing is that I am seeing, somewhat unexpectedly, a resurgence in big NUMA servers. I mean, people still call them SMP systems out of habit, but they are really NUMA boxes.

hp-kirk-bresnikerKirk Bresniker: Something different has emerged here. We started seeing it first with our ProLiant DL980, which has eight "Westmere" Xeon E7 sockets, and with the plan with our "DragonHawk" product, where we are bringing Superdome scalability to the next-generation Intel Xeon. And it has really been driven by SAP's efforts to drive its HANA in-memory database not just for analytics, which has traditionally been a scale-out application, but to online transaction processing, which is really calling out for that big memory space.

This is sort of changing the tide. We have seen people realize that they can move from millisecond latency on even moderate-sized clusters to nanosecond latencies collapsing it down to a single shared memory platform.

With DragonHawk, with hundreds of cores of next-generation Xeon processors and tens of terabytes of memory, that is a huge fat node. And for some clustering applications, and it doesn't matter if it is a scientific process or a business process, they model into a 3D space and a shared memory system is better. The easiest one for me to get my head wrapped around is weather modeling, but it can also be a complex, multidimensional business processes.

You can certainly simulate the weather or the business process with a cluster and use an InfiniBand network to put this over a large number of nodes. But if communication is your limiting factor, and the model is 10 TB or 20 TB in size, and these upcoming platforms have that kind of capacity, and you can take MPI or any other message-passing method you want to come up with and pass them on the memory. That gives you tens to hundreds of nanoseconds of latency over hundreds of cores and tens of terabytes of main memory.

I think this is starting to be interesting to non-HPC users.

TPM: So do you think that financial modeling, risk management, and similar workloads will be moved onto such shared memory machines? I mean, time is money to these companies.

Kirk Bresniker: Really, what it comes down to is that even the best low-latency Ethernet switch, it is a couple hundred nanoseconds from port to port. And that is buying a super-expensive machine. But what that is not counting is that you also have to go through the Ethernet stack on both ends, and that is a substantial piece of software and makes that ping-pong latency more like 2,000 or 3,000 nanoseconds. And if I compare that to being able to use a shared memory operation, where it might be 100 or 200 nanoseconds total, that is a cache-to-cache kind of latency. So if your model fits within the memory and the core count, then you could have a very cost efficient and space efficient collapsing of a cluster down to one of these large memory machines.

This is sort of back to the future.

TPM: That was why we wanted to talk about this.

Kirk Bresniker: It is simple. If communication is not your limiter in your application, then go cheap and cheerful. Use Ethernet or InfiniBand. If you need 1,000 nodes of a couple hundred gigabytes apiece, then that model is not going to fit in a DragonHawk. But if the application is dominated by communication and latency, then a shared memory system is something to look at.

The other attractive thing about this, since this is a back-to-the-future thing, is that as people go back through their MPI codes and to the days when these large memory systems ruled the Top 500 list, they are going to find code stubs in there to do efficient message passing on shared memory machines. They may not have used them in the past fifteen years, but chances are that they are still in there. And after the journey they took from shared memory to clusters, I don't think they will find it too hard to turn back the clock and reutilize some of these things.

HP's current Superdome machine

HP's current Superdome machine

TPM: How far can you push these shared memory systems architecturally? HP has pretty sophisticated chipsets and you can push up to 32 or 64 sockets and maybe even as far as 128 sockets, although it has been a long time since anyone has even talked about a machine that large. How far can you push the main memory?

Kirk Bresniker: The first thing we bump up against is not the number of CPU sockets we can stitch together on a fabric, but the number of physical address bits that our microprocessor vendors are handing out. Intel is at 46 bits and Advanced Micro Devices is a little bit ahead with 48 bits.

It is not as simple as adding a couple more bit lines on the microprocessor for the address registers. As you cross over that next line, you have to add another level on your TLBs [translation lookaside buffers, which are caches that hold the data for virtual-to-physical memory address translations]. Then you have to redo your memory handler. It actually ends up being an intrusive change.

With the systems that we have already talked about on our roadmap, if we doubled again, we would exceed the physical address space. So we are bumping up against that.

That being said, I don't think anyone is going to let chip makers hold that back. One of the people on my team was at HP when they had 16-bit microprocessors and we needed to get to a 32-bit address space. So there are proven techniques around to deal with this.

What will be interesting is for us to look at a massive expansion of either direct or windowed access memory, and what will be that memory. We are running into the penultimate generation of flash, the penultimate generation of DRAM, and we have all of these new technologies coming on such as spin-transfer torque RAM, phase change memory, and HP's own memristor. We have several of these threads coming together, and we have the potential to see memory scale once again. We are running up against the limits of the current technologies and architectures, but I think that people are going to be clever enough to go beyond those physical memory limits.

To bring this full circle, for the enterprise customer, they will have not only large pools of memory, but large pools of persistent memory. With these customers, the integrity and persistency of the data is a lot more important than with other customers. Not that this is not important in high-performance computing and scientific simulations. But if I am running a simulation, I can always run it again. If I am tracking a live process model of a business, I don't always get a do-over.

With photonics to stitch these large, persistent memory pools together, there may be a synthesis of what we think of as shared memory systems and shared-nothing models. We might be picking and choosing the best of both breeds on the same systems when we apply them to business processes. That is not to say that there will not continue to be massively parallel scale out problems, but this could be a different style.

8 Responses to NUMA’s Revenge: The Resurgence of Shared Memory Systems

  1. TPM: How far can you push these shared memory systems architecturally? HP has pretty sophisticated chipsets and you can push up to 32 or 64 sockets and maybe even as far as 128 sockets, although it has been a long time since anyone has even talked about a machine that large. How far can you push the main memory?

    SGI has been shipping ccNUMA systems in the 4 TB and above range since 2006 beginning with the SGI Altix 4700 series Intel Itanium based servers.

    In 2009 SGI dropped Itanium, transitioned to x86 and introduced the Ultraviolet (UV) family of Intel Xeon based servers. The UV family began with UV 1000 which supported up to 16 TB of globally addressable cache coherent memory.

    SGI’s current UV incarnation, UV 2000, supports up to 64 TB of globally
    shared DRAM memory in a single system image. The first UV 2000 shipment went to Dr. Steven Hawking’s UK Computational Cosmology Consortium at the University of Cambridge in July of 2012. That system today consists of 232 Xeon E5-4600 processors, 32 Xeon Phi coprocessors, and 14.5 TB of globally shared memory.

  2. Timothy Prickett Morgan says:

    Mea culpa, Michael. I was thinking of volume IBM, Dell, and HP X86 systems when I did the interview.

  3. dbs says:

    SGI is an important data point. They’ve been doing NUMA the entire time, while all the other HPC vendors have given up. But they haven’t gotten much of a competitive advantage from it so far. Maybe that will change.

    You also forgot Oracle and Fujitsu, which can deliver 10s of sockets and 10s of TBs in a single SPARC server. I’m not sure how many address bits the latest SPARC CPUs have, but they probably have more than the x86_64.

  4. Robin Harker says:

    @Michael Anderson

    If you want the holy grail of (very large, ie 256Terabytes of shared memory), x86 ccNUMA using commodity HW, look at NumaScale from Norway, some of whom, when at NorskData invented SCI, (scalable coherent interconnect), which is the spec used by Convex all of those years ago.

    @dbs Actually its incorrect to claim all vendors have given up on NUMA, apart from SGI, as HP’s Superdome was originally from Convex, and it was they not SGI who shipped the first ccNUMA platform.

  5. […] to the open source Linux effort and creating scalable NUMA machines based on Xeons. (See EnterpriseTech’s interview with Kirk Bresniker, who is chief technologist for HP’s Business Critical Systems division, about the resurgence of […]

  6. […] Xeon E7 v2 processors and tuned specifically to run SAP’s HANA in-memory database. HP is working on “DragonHawk,” which is based on its own chipset and interconnect, which is based on the Xeon E7 v2 […]

  7. […] like a job for NUMA, whether it is done in hardware or software, and in fact, this is precisely why Hewlett-Packard is working on its “Project Kraken” machine and SGI is working on its “HANA Box” variant of the “UltraViolet” UV 2000 […]

  8. […] HP’s techies have proposed porting the venerable HP-UX Unix variant to Xeon processors a number of times, but each time the effort was shot down by the top brass at the company for reasons it has not explained. It is not hard to figure. For the most part, NonStop customers write their own applications, most often in Java these days, so moving their applications from Itanium to Xeon is not a big deal. The Java stack takes care of most of it, and HP is offering a source code guarantee for those who use other NonStop compilers to write their code. On HP-UX, however, most customers are using an Oracle database and third party applications, generally from Oracle or SAP. It seems very unlikely that any of the system software providers will do ports of their code to HP-UX on Xeon, given the relatively small base. SAP has clearly tied its future to Xeon E7 machines running SUSE Linux and its HANA in-memory database, and Oracle gas no interest in helping HP-UX sell against its own Sparc SuperCluster or Xeon Exadata “engineered systems,” which run its own Solaris and Oracle Linux operating systems. If HP had done an X86 port of HP-UX many years ago, long before either SAP or Oracle were interested in systems, there might have been a chance for HP-UX on Xeon and Opteron processors. Knowing this, HP has instead pitched Linux and Windows on its future Xeon E7 “Project Odyssey” machines, which will scale up to sixteen sockets and 16 TB of main memory in a single image using NUMA clustering. […]

Add a Comment