Advanced Computing in the Age of AI | Saturday, June 25, 2022

The Math On Big NUMA Versus Clusters 

Even if you are not thinking of buying a big Sparc/Solaris server, some new math by Oracle might get you to thinking about shared memory machines as opposed to clusters.

This is particularly important as Intel is getting ready to revamp its high-end Xeon E7 processor line and SGI and Hewlett-Packard, and perhaps others, are forging shared memory machines based on these Xeon E7 chips that go well beyond the normal two and four sockets. SGI, as EnterpriseTech has reported, is working on a "HANA Box" variant of its UV 2000 system that will span up to 64 TB of shared memory, tricked out with the just-launched "Ivy Bridge-EX" Xeon E7 v2 processors and tuned specifically to run SAP's HANA in-memory database. HP is working on "DragonHawk," which is based on its own chipset and interconnect, which is based on the Xeon E7 v2 processors as well, and which will span hundreds of cores and tens of terabytes of shared memory. There is a variant of this system in the works, code-named "Kraken," that will also be tuned up for SAP HANA.

In days gone by, during the dot-com boom where processing capacity was a lot more scarce than it is today, Sun Microsystems had a tidy business reselling its "Starfire" E10000 NUMA systems to just about every startup and plenty of enterprises looking to offload work from IBM mainframes or aggregate it from proprietary minicomputers. The Starfire systems were kickers to the Cray CS6400, a line that Sun bought from Cray in 1996 and from which it literally minted billions of dollars. The Starfire was the most scalable RISC server on the market at the time, with its 64 sockets and a whopping 64 GB of main memory across its sixteen system boards.

Oracle's Sparc M6-32, which was announced last fall, is the great-great grandchild of the Starfire system, and it is also one of the most scalable systems on the market today. The fact that it is not based on an X86 architecture makes it a tough sell in a lot of datacenters, but because Oracle controls its own chipsets and interconnects, it can scale the M6 system up to 96 sockets and 96 TB of shared memory. The company has not done that yet, so SGI's "UltraViolet" UV 2000 system, which has 256 sockets and 64 TB of shared memory, is still ruling when it comes to scalability on general purpose machines. The word on the street is that HP will span sixteen sockets with DragonHawk, but it could do more than that. The chipsets it created for its Itanium-based Superdome 2 machines span up to 32 sockets, and HP has engineered 64 socket machines in the past, too. And, of course, Cray's Eureka massively multithreaded machine, based on its own XMT-2 processors, is the most scalable shared memory system ever created. It can have 8,192 processors, 1.05 million threads, and 512 TB of shared memory in a single system image.

Suffice it to say, there could be a resurgence of shared memory systems, particularly among customers whose workloads are sensitive to bandwidth and latencies between nodes and will run better in a shared memory system than they do across clusters of standard two-socket X86 nodes lashed together with Ethernet or InfiniBand fabrics.

The fun bit of math that Oracle did in a recent whitepaper to try to peddle its new M6-32 system was to compare one of these big bad boxes to a cluster of its own Sparc T5-2 systems. Take a look:


The Sparc M6 processor has twelve cores running at 3.6 GHz, so the 32 socket machine has a total of 384 cores and 3,072 threads in a single system image. The M6 chip has 48 MB of L3 cache that is shared across those dozen cores and can have up to 1 TB per socket of main memory. EnterpriseTech attempted to get the precise memory configuration used in this comparison, but it was not available at press time. Based on the comment that there was room to expand the memory by a factor of two in this chart, it stands to reason that this M6-32 machine as configured with 16 TB of memory across those 32 sockets.

The shared memory system is enabled by a homegrown interconnect called "Bixby," and as you can see, it provides significant advantages in terms of bandwidth and low latency between the processors compared to a network of sixteen Sparc T5-2 two-socket servers using the Sparc T5 chip. The Sparc T5 chip runs at 3.6 GHz as well and is based on the same "S3" core as is used in the M6 chip. However, the T5 has sixteen cores on the die, but only 8 MB of shared L3 cache across those cores and can only have a maximum of 512 GB of memory per socket. Oracle has normalized the memory across the shared memory and clusters so they both have 16 TB, but the T5 cluster has 512 cores (33 percent higher than the M6-32 system) and 4,096 threads (again, 33 percent higher).

The interesting thing is that the M6-32 system has a factor of 100X lower latency between processors than the T5 cluster connected by 10 Gb/sec Ethernet switches and has a factor of 37 more bandwidth across that interconnect. And, amazingly, Oracle is charging, at a little more than $1.2 million, about the same price for the two machines.

The more interesting comparison perhaps would start with a 96-socket M6-96 with 96 TB of memory, or better still, twice that with even fatter memory sticks when they become available. To this, it would be intriguing to see what a Sparc T5 cluster with 10 Gb/sec Ethernet would cost and then perhaps a bare-bones hyperscale system using InfiniBand interconnect equipped with the latest twelve-core "Ivy Bridge-EP" Xeon E5 chips from Intel.

6 Responses to The Math On Big NUMA Versus Clusters

  1. Steve Thomas says:

    In my own realm of things (extreme-scale Internet apps) the prevailing doctrine of solving “big” problems has been to break up the problem into thousands of cheap servers into a grid arrangement, adding capacity in one dimension or another (queries per second, data size) depending on what you need to grow.

    This doctrine came about for a variety of reasons, but the essential driver was the ultimate limits of a single box would paint you into a corner, and the (probably correct) assumption that the premium charged by big-box vendors made the “lots-of-cheap-boxes” approach much cheaper.

    However, this doctrine adds tremendously to system complexity and either practically or absolutely limits your application’s capabilities. A single-system application is always going to be better and easier, all other things being equal.

    So what if those old assumptions have changed? What if you can buy a single box that’s as big as you can imagine needing for [longer than the next paradigm shift]? What if the big-box vendors are ready to price their offerings competitively with the small-commodity-box approach on a cost-per-performance/capacity basis?

    That 32TB system alluded to above can easily be yesteryear’s 2000 node grid of 16GB commodity systems–except that you don’t have tons of waste in communicating over a network layer, 2000 operating system images worth of memory, layers of software and hardware to break-up/re-assemble your problem, etc. etc.

    I’d be curious what others think. Is the grid dead? As a developer I sure hope so, as I enjoy adding direct user-visible features to my system a lot more than writing code to coordinate a grid…

  2. Steve Conway says:

    Hi! Actually, YarcData’s Urika big data system scales to 512TB of global memory. Best, Steve Conway

  3. Timothy Prickett Morgan says:

    True on the YarcData, of course. And a crazy number of threads, too.

  4. Mark says:

    The Enterprise 10000, aka “Starfire”, was not a NUMA system. It was an UMA SMP. That was important in the 1990s when operating systems, databases, and applications were not NUMA aware, leading to unpredictable performance on NUMA systems. This is why Sun’s E10K did so well running large OLTP databases compared to SGI Origin, HP V2500, and Sequent NUMA-Q systems.

    Sun’s 4800/6800 “Serengeti” systems were its first NUMA systems, but Sun like to call them “nUMA” (Small-N NUMA) because the local vs. remote latency difference was small. These used a snoopy based coherence mechanism and had only two main levels of latency (board and system). The 15K “StarCat” was the first “real NUMA” system, which used directory coherence and had three main levels of latency (board, directory hit, and non-directory hit).

    At the same time, the AMD Opteron x86 systems came out, which were also Small-N snoopy NUMA systems.

    In the early 2000s, with AMD Opteron, Sun Serengeti, HP Superdome, and IBM Regatta, NUMA became the norm. Windows and Linux provided NUMA support to x86 for AMD Opteron, which was perfected by the time Intel’s NUMA Nehalem processors arrived.

    The real question is how big of a server do you need? The new 4-socket, 60 core, 6TB E7 v2 systems have been announced. A single system would likely approach 5 million tpmC, which meets many enterprise requirements, and four nodes could clustered with Oracle RAC for larger workloads.

    Big systems like the IBM Power 795, SPARC M6-32, and HP Superdome 2 are often partitioned into smaller logical systems.

    There is probably some quanta (tpmC, SPECint, GHz, GB) of what size logical server is needed. If that will fit into a 2-socket or 4-socket x86 system, it is hard to justify a logical partition on a RISC/UNIX system.

    • Steve Thomas says:

      For anybody starting a new project–and committing to an architectural approach–the question is not, “how much capacity do you need”, but rather, “what is the absolute max capacity you will ever need”.

      For Internet-based apps, the answer can easily be 1000x from where you started. That’s why many have approached the problem by creating thousands of little servers.

      If you can envision your problem running on a single system even if it becomes something everybody uses, then that means you can approach your scalability path will be in-system expansion, and your application architecture will be much simpler.

      (This brings up another important sea-change that might drive this new doctrine: the cloud. Any growing company [or app within a big company] will use the cloud to grow–it’s a perfect fit, financially. This handles one major downside to a single-system approach, which is start-up costs.)

  5. Renu Raman says:

    Just because you can build one does not mean you should build one. While the M5 – > M7 roadmap is impressive in the context of practical use cases, its over-designed. Here’s why.

    1) The majority (>95%) of databases in the world is <1TB, individual databases are not growing at >60% i.e. sub Moore's law. Even when it did, over the past 15 years (as witnessed by the web design point), the programming model while simpler for shared memory moved towards message passing.
    3) As another respondent noted, you don't need gazzilion TPMc. While analytics workload do demand higher transaction rate, but they are largely read-only. Update-oriented transaction workload are not growing in size dramatically. Analytics can be handled much better in scale-out
    4) Eric Brewer's CAP theorem captured the essence of dichotomy. SMPs address consistency and partitioning at the expense of availability while clusters address availability and consistency (to some extent) at the expense of partitioning.
    5) Interconnects (like 10G are slow). When interconnects approach 0.75x of local memory bandwidth (we are getting there), then the problem with data placement and partitioning is less severe .We are going to see interconnects go up 100x in the next 5-7 years. One example of this is here Compare the M6/M7 scalability links with PCIe and what happens when we get to 100Gbit links in the next 2-3 years.
    6) I say M6/M7 is over-designed because of all the above. Some of the die area is better spent in improving single thread performance (M7 – S3 core is still slower than both Power 8 and x86) while Intel could afford to spend more die area to for coherent links and additional PCIe/new interconnects. A fully interconnected Haswell (8-way) with 64 PCie lanes per haswell will swamp the M7 based system design point for 98% of the practical deployment cases.

    Thus my comment about M7 – they did the massive directory based system because they can rather than because one needs it. It was also a way to differentiate from Intel based systems.

    Even with Starfire, majority of the use case was domaining the larger system (disaggregation). With new interconnects, its cheaper, faster and simpler to aggregate than disaggregate.


Add a Comment