Advanced Computing in the Age of AI | Friday, March 29, 2024

Shared Memory Clusters Accelerate Databases 

In-memory databases are going mainstream to accelerate analytics, and if the success of SAP's HANA in-memory database is any guide, then it looks like companies will be looking at deploying in-memory databases for their transaction processing systems, too.

The trouble is the NUMA systems that have large memory footprints are not cheap and the dense main memory sticks that are used to boost their capacity are not, either.

The vSMP Foundation hypervisor from ScaleMP, which glues multiple physical machines together into a virtual shared memory system, gives companies the option to break beyond the confines of a particular machine and build large Linux clusters that behave like a single physical shared memory system. The interesting bit about vSMP is that it allows companies to use a mix of low-cost servers to attain the core counts and memory footprint of a much larger and more expensive X86, RISC, or Itanium system. Or to mix and match skinny and fat nodes as they see fit to get a certain CPU and memory profile.

Shai Fultheim, founder and CEO at ScaleMP, tells EnterpriseTech that the company has more than 4,500 installations of vSMP worldwide now, and is selling in 32 countries. Most of the customers are using vSMP to aggregate systems for compute and memory capacity together for HPC-style workloads, but somewhere on the order of 20 to 25 percent of those paid customers (as distinct from those using the free version of the software) are using vSMP expressly to expand the memory footprint of a single box. And generally, they are doing so to support database and related analytics workloads.

The idea here is to take a relatively expensive server with a fair amount of compute and lash it to a bunch of cheap boxes with cheap chips that are relegated to just running the vSMP hypervisor and providing memory capacity for the primary box. Fultheim says that about a third of current sales for the vSMP memory expansion version of its software are going into enterprise customers rather than into government or academic supercomputing centers. While this is a small part of its current business – somewhere between 7 and 8 percent – it is growing fast.

One popular use of a vSMP platform among enterprises is database acceleration. If you can get more of the database into main memory, then queries and transactions obviously run faster. A flash card plugging into a PCI-Express peripheral port might have an access time on the order of 40 to 60 microseconds, and that is pretty fast compared to a collection of disk drives that a database might be spread over. But thanks to prefetching routines and the fairly predictable nature of database queries, according to Fultheim, vSMP can fetch data from a remote node in a vSMP cluster in under 1 microsecond. This is important for databases like MySQL, which Fultheim says is very sensitive to memory capacity.

The big driver for ScaleMP's sales, however, is the money it can help customers save by doing a virtual NUMA box instead of a real one.

It costs on the order of $3,840 at list prices for a 64 GB memory stick that would plug into systems based on the new "Ivy Bridge-EX" Xeon E7 v2 processors from Intel, or about $60 per GB. A skinnier 32 GB stick costs on the order of $1,280, which is $40 per GB at list price, and you can probably get these for somewhere between $700 and $800 per GB on the street, or roughly between $10 and $13 per GB. So while you can, in theory, get an eight-socket Xeon E7 v2 machines that has 12 TB of capacity across 192 sockets, there are two problems with trying to buy such a machine to run a big workload inside main memory. First, there is very little discounting on 64 GB DDR3 main memory because of the low volumes at this point in their ramp, and second, that 12 TB is going to cost on the order of $700,000 after a token 5 percent discount. With vSMP, you can build a cluster of smaller machines with similar shared memory for half the price or less of just the memory on this box. 

Another issue is that physical servers have their processor, memory, and I/O capacities set in their chipsets and motherboards – there are only so many sockets, so many different processor variations, numbers of memory slots, and memory stick capacities, after all. Moreover, hardware makers tend to increase these capacities in lockstep, which means even if you don't need the processing capacity, if you need more main memory you have to take the processors and therefore pay the software licenses for those cores.

Just ahead of its rollout of vSMP appliances with Cray and IBM last September, ScaleMP launched a freebie version of its memory-lashing hypervisor, called vSMP Foundation Free, which allows machines using InfiniBand links to be hooked together at no charge. The free version of vSMP can span up to eight physical servers and can address up to 1 TB of main memory across those machines. Fultheim says that MySQL shops who cannot afford to buy real NUMA machines from the likes of IBM, Dell, Hewlett-Packard, SGI, and others to scale up their memory are taking whatever servers they have laying around and building a virtual NUMA box.

Fultheim did a little math on a scrap of paper and came to the conclusion that by the end of the year, ScaleMP could have more customers on the freebie version than on the paid version. Each one of those freebie shops is a potential customer for the paid-for vSMP Foundation when they need to bust above 1 TB of memory, and that is the bet Fultheim is willing to make. The important thing is that the freebie version is a leading indicator of sorts for database acceleration and the desire to move more processing into main memory from disk drive or flash storage.

Fultheim can't talk specifically about the sales of the vSMP appliances made by Cray and IBM are doing in terms of sales, except to say "they are doing well" and that the company had just added UK system maker Boston Ltd as an appliance partner. While the Cray and IBM vSMP appliances are based on Intel Xeon E5 and E7 machines, the xScaler-vSMP setup from Boston uses a mix of Opteron and Xeon machines. Specifically, the base compute node is a four-socket machine using sixteen-core Opteron 6386SE processors running at 2.8 GHz. A two-socket Xeon E5-2600 server is used as a memory expansion node, and you can add from four to sixteen of these to get an aggregate main memory capacity of between 2 TB and 8 TB across the cluster. Only the Opteron processors can be used for compute, and that means those are the only ones that need software licenses.

Here is a table comparing the vSMP appliances available from IBM, Cray, and Boston and how they compare to a four-socket ProLiant DL580 Gen8 server and a future sixteen-socket machines called "Project Kraken" from HP, both of which are based on the new Xeon E7 v2 processors from Intel:

scalemp-vs-numa

As you can see, the vSMP appliances offer more main memory at the same or lower cost as a real NUMA machine from HP.

In conjunction with the release of vSMP Foundation 5.5, launched this week, ScaleMP has rejiggered its pricing scheme to make it easier to deploy for either big compute or big memory workloads. The way it works now, processors are given a rating ranging from 3 to 30 tokens, based on the class of processor, the number of cores, and so on and vSMP Foundation is licensed at $100 per token. Main memory is licensed at $160 per 32 GB chunk, which is given a rating of one token. If you are using server nodes just for memory expansion, you only pay for the memory portion of the license. This pricing will also allow for licenses to be transferred between different generations of machines, something that was problematic with the prior licensing scheme, according to Fultheim. (ScaleMP has created an online configurator so you can see what license fees for vSMP will be for the current crop of Xeon and Opteron servers for both big compute and big memory workloads.)

The vSMP Foundation 5.5 release supports the new "Ivy Bridge" Xeon E5 v2 and Xeon E7 v2 processors from Intel. The AnyIO feature that ScaleMP was talking about last year, which allows for Ethernet as well as InfiniBand links to be used to create a virtual SMP. The update also includes support for Fusion-io PCI-Express-based flash storage, which is increasingly popular as a means of accelerating database performance as well as data access for other kinds of workloads. And finally, the update includes the ability to have the vSMP hypervisor see Intel Xeon Phi X86 coprocessors and Nvidia Tesla GPU coprocessors in the shared memory system and dispatch work to them.

The basic scalability of the vSMP hypervisor remains the same with the 5.5 update. It can scale across up to 128 nodes with a maximum 32,768 cores and can address a maximum of 256 TB of shared memory. That is four times the shared memory of SGI's UV 2000 system, which has NUMAlink interconnects to create the shared memory. No one comes close to the 512 TB of Cray's Eureka massively multithreaded machine, of course. ScaleMP has the virtue of running on standard X86 iron, and offers a lower price for NUMA functionality compared to these and other RISC and Itanium alternatives. ScaleMP only supports Linux workloads on top of the vSMP hypervisor, and to be precise, that is Red Hat Enterprise Linux 5 and 6 and SUSE Linux Enterprise Server 11.

The interesting thing to contemplate is what ScaleMP might do to tune vSMP to accelerate in-memory databases. Oracle's TimesTen database has been around for many years and is used in Oracle's Exalytics appliances for zippy analytics. IBM has its new BLU Accelerator add-ons to DB2, and of course there is SAP HANA. HANA is being used for both analytics and for transaction processing these days, and while it is fairly simple to partition data across multiple nodes to do analytics, a complex ERP system with countless interrelated tables cannot be so easily partitioned across a cluster of nodes.

This seems like a job for NUMA, whether it is done in hardware or software, and in fact, this is precisely why Hewlett-Packard is working on its "Project Kraken" machine and SGI is working on its "HANA Box" variant of the "UltraViolet" UV 2000 system. If these system makers are tuning up their iron for SAP HANA, it stands to reason that ScaleMP sees a similar opportunity.

EnterpriseAI