Advanced Computing in the Age of AI | Tuesday, April 16, 2024

IBM Takes On Big Workloads With Power8 Enterprise Systems 

Customers running big workloads on IBM's largest Power Systems servers got a preview this week of the high-end Power8 machines that the company will be rolling out over the next several months. The new Power E870 and E880 machines represent a merger of sorts between two classes of Power systems and are designed to bring the modularity and relatively low price of one set of machines and the high availability and resiliency of the other into the same product, thus simplifying IBM's product line and perhaps goosing its profit margins.

Now that IBM has sold off its System x division to Lenovo Group for $2.1 billion, a deal which was completed in major markets last week and which will close in the coming months in the many markets where IBM and Lenovo both do business, IBM is free to go directly after Intel's Xeon machines with its Power alternative and also to continue to push into Oracle Sparc and Hewlett-Packard Itanium accounts as it has been doing for decades.

For about a decade after shipping the Power4 servers back in 2001, IBM had the performance and price/performance advantage over its rivals in the Unix systems market, and used that advantage to go from being a distant third in a Unix server market the represented about half of worldwide server revenues to the undisputed leader of that market. However, economic hardship generally causes platform transitions and the Unix market has been in steady decline since the Great Recession and is now half its size. Many workloads that might have otherwise had run on big iron that uses non-uniform memory access (NUMA) clustering has been rewritten – or written from the ground up brand new – to be distributed across multiple nodes instead of running on a big shared memory system. The upshot is that the Unix market now accounts for about 15 percent of server revenues and Linux iron now accounts for about a third. Mainframes accounting for around 5 percent, and Windows makes up the bulk of the remaining server platform sales. Linux is on the rise, even for big iron, as in-memory processing takes off, but Unix still has a place running very large databases and applications in the enterprise datacenter.

Oracle has done a good job of making the sales pitch for its Sparc M series of NUMA machines, and indeed all such shared memory clusters, if you want to use that term. The idea is that for applications that are extremely sensitive to network latencies, a NUMA machine with a proprietary interconnect and a shared memory space will do better than a commodity cluster using even the fastest Ethernet or InfiniBand.

In theory, the Oracle Sparc M systems can scale up to 96 processor sockets and up to 96 TB of main memory in a single system image using the "Bixby" interconnect, but thus far Oracle has only pushed up to 32 sockets using the 12-core Sparc M6 processors. That interconnect has a 150 nanosecond latency across the mesh and provides 24 Tb/sec of interconnect bandwidth. Next year's Sparc M7, which Oracle has already previewed, will pack 32 cores on a die and with a 32-socket implementation of a next-generation Bixby interconnect, it will be able to bring 1,024 cores, 8,192 threads, and up to 64 TB of main memory to bear on a single workload.

This is the same argument made by SGI with its UV 2000 machines, which can have up to 256 processor sockets (that's 2,048 cores and 4,096 threads) and 64 TB of shared memory lashed together using the NUMAlink 6 interconnect.

The top-end Power E880 machine from IBM, using Power8 processors and their on-chip NUMA interconnect, will not come close to this, with 16 sockets and 16 TB of memory when it is fully extended. The Power 795, launching in April 2010, had 32 sockets using eight-core Power7 chips and could bring 16 TB of main memory to bear on a single workload. But with the Power E880 design, IBM has added another NUMA port to each chip and flattened the interconnect, thereby reducing the number of hops data has to take as it moves around the NUMA cluster and thereby boosting performance. The upshot is that IBM can get the same performance in a 128-core Power8 system (sixteen sockets with eight cores per socket) as it used to get with a Power 795 with 32 sockets using eight-core Power7 processors. Next year, the company will deliver a full twelve-core Power8 chip, bringing the core count up to 192 in a single image, which will deliver 1,536 threads across that 16 TB of memory in a fully loaded box.

Even Hewlett-Packard is getting back into the game with its "Project Kraken" Xeon E7 system, which has up to 16 sockets (for 240 cores and 480 threads) across up to 24 TB of main memory. The Kraken system uses a modified version of the sx3000 chipset used in HP's Superdome 2 machines to lash together eight two-socket nodes.

While many of us love the elegance and craft in the design of such behemoths, even with a resurgence in in-memory processing, it is very unlikely that Oracle, SGI, or IBM will sell more than high hundreds to low thousands of such machines. It would certainly make life easier for programmers if they did, because a shared memory system looks more or less like a big workstation to a developer. (There is plenty of tuning in a NUMA system that has to be done to get the most performance out of these boxes, and this tuning must be done once you consider the cost of such big iron.)

IBM was the aggressive innovator with its Power iron back in the early 2000s, this time around Oracle and SGI are pushing the limits. IBM seems more inclined to give its current customers some more headroom and leave it at that, and would no doubt argue that most in-memory work can easily fit into its 16-socket systems.  And thus it seems that IBM has designed the Power E870 and E880 machines launched today at the company's Enterprise2014 event in Las Vegas with practicality and lower cost in mind. Steve Sibley, director of worldwide product management for IBM's Power Systems division, tells EnterpriseTech that the goal was for the E870 and E880 using the Power8 processors to deliver somewhere between 35 and 40 percent more performance at roughly the same price as the Power 770+ and Power 780+ systems using the Power7+ chips from October 2012.

Since the Power5 machines were launched a decade ago, IBM has put three different kinds of machines into its Power Systems line. The low-end was made of machines with one, two, or four sockets and that competed more or less directly with X86 iron; these machines used to have external chipsets to glue the processors together, but over time these NUMA circuits were migrated onto the chips. The top-end of the line was a NUMA cluster of NUMA nodes, with an external chipset (often called a node controller) linking multiple NUMA systems together in a shared memory hierarchy. The Power 595 (Power5 and Power6) and 795 (Power7) machines did this. Between these two were so-called enterprise class machines, which glued together multiple NUMA nodes together using the on-chip NUMA. The idea here is to make a machine that scales above four sockets but doesn't cost as much as the machine that scales to 32 sockets, mainly because it doesn't need a lot of the clock and resiliency features that the big, bad box requires. A company that spends millions of dollars on a shared memory machine wants zero downtime.

ibm-power8-numa-1

With the Power E870 and E880 servers, IBM is taking the modular node approach of the enterprise class machines, using four-socket nodes. But it is beefing up the NUMA clustering with a system control unit that is ripped out of the heart of the Power 795 to provide service processors and centralized clocks and oscillators for the NUMA cluster. These new Power8 enterprise-class machines fit into standard racks and do not require the large custom system enclosures that the Power 795 and the System z mainframe use. (These were probably done intentionally by IBM, for effect, given how much dough is spent on these systems.)

Here's the block diagram of the inside of the four-socket nodes at the heart of the Power E870 and E880 systems:

ibm-power8-numa-2

As you can see, the SMP crossbar interconnect (which really should be called a NUMA crossbar, but old habits die hard) has links between the four processors on the card that have an aggregate of 76.8 GB/sec of bandwidth and only one hop between the processors. The SMP bus extends out from each processor to remote enclosures over links that run at 25.6 GB/sec. Each socket has eight memory slots, which have 28.8 GB/sec of bandwidth each and each socket has two PCI-Express 3.0 x16 peripheral ports hanging off it, each with 15.75 GB/sec of bandwidth.

This is what a schematic diagram of the four-socket node in the Power E870 and E880 looks like:

ibm-power8-numa-3

IBM is packing in the memory at the front of the node, followed by the processors and the PCI-Express peripherals. The four power supplies for the unit ride underneath this complex, and the whole enchilada fits into a 5U enclosure with ports for the NUMA interconnects in the back.

This is how the NUMA links line up across a four-node enclosure with 16 sockets:

ibm-power8-numa-4

IBM could introduce a node controller to further extend this system, should customers need it, but we got the distinct impression from Big Blue that while this is technically possible, it is not probable. The more sockets you add, the harder it is to get work out of a system, and unless customers really do need to bring more performance to bear on a single workload or database, it doesn't make much sense to do a science project. IBM is positioning Power8 machines against X86 iron, not Oracle's high-end Sparc M machines. If Oracle starts selling lots of big NUMA machines, you can bet that IBM will retaliate.

This is how the Power E870 and E880 systems stack up against each other:

ibm-power8-numa-5

The E870 comes in two flavors and have either one or two nodes plus the system control unit to keep the processor clocks all synchronized. The machines cannot mix clock speeds in a single system, or more precisely, if you try to put in faster processors it gears them down to the lowest clock speed so there is no point in doing it. The less powerful E870 machine uses an eight-core single-chip version of the Power8 processor running at 4.02 GHz. The machine can scale up to eight sockets and currently has a maximum of 512 GB per socket of DDR3 memory running at 1.6 GHz, but early next year this will be doubled up to 1 TB per core with a 128 GB custom memory card that IBM is putting the finishing touches on now. Those who want a little more oomph in their Power E870 machines than a 64-core box running at 4 GHz can deliver can get a system using ten-core Power8 chips that run at a slightly faster 4.19 GHz clock speed. It is not clear how much more performance this will deliver, but the core count goes up by 25 percent and the clock speed is up by 4 percent, so it should be significant. The E870 tops out at two nodes. Why you ask? Because IBM wants to charge a lower price for compute for this machine, or rather, it wants to charge a premium for four-socket scalability and bake it into the iron itself.

The Power E870 machines will be available on November 18. The machine supports IBM's own AIX 6.1 and 7.1 versions of Unix as well as its proprietary IBM i 7.1 and 7.2 operating systems. Red Hat 6.5 and SUSE Linux Enterprise Server 11 SP3 are also available on the machines.

The Power E880, which replaces the Power 795 in IBM's Power Systems lineup, can scale from one to four nodes and will also come in two flavors. One version will use eight-core Power8 chips running at 4.35 GHz, and a four-node setup will have 128 cores and up to 8 TB of main memory across those nodes. A future version scheduled to ship next year will support up the full twelve-core Power8 chip running at an as-yet unannounced clock speed. The word on the street is that the fully-loaded Power E880 will have about 40 percent more aggregate performance than the top-end Power 795, but IBM has not provided the feeds and speeds yet. IBM will also boost the main memory capacity to a maximum of 4 TB per node, or 16 TB across the system, next year, when it introduces its own 128 GB memory cards for the enterprise-class systems. The initial one-node and two-node E880 configurations using the eight-core Power8 processor will ship on November 18 and will support the same AIX, IBM i, and Linux operating systems. IBM said that the top-end E880 machines with three or four nodes, or using the twelve-core Power8 chip, would be available in 2015 and was not more precise. Given how long Power 795 customers have waited for an upgrade – it has been over four years now – you can bet IBM is working to do it as soon as possible in the new year.

EnterpriseAI