IBM Accelerates Power8 Clusters With GPUs, FPGAs, And Flash

It is perhaps a lucky stroke of timing or perhaps by design that only days after Big Blue sold off its System x X86 server business to Lenovo Group for $2.1 billion that the company is coming out swinging with Power8 servers that are augmenting their performance using a variety of adjunct co-processors and flash storage. But ahead of next week’s Enterprise2014 event in Las Vegas, where it will be talking about its increasing focus on Power Systems and System z mainframes, the company is launching a number of systems that are designed to take workloads away from X86 clusters. As EnterpriseTech has previously reported, IBM has been telling customers to expect larger Power8-based machines with more than two sockets as well as systems that would use field programmable gate arrays (FPGAs). IBM has also been hinting that OpenPower partner and GPU coprocessor maker Nvidia would be working together to get a Power8-Tesla hybrid system into the field before the end of the year. It turns out that IBM is launching three different systems tuned up for three different kinds of workloads that are based on its “scale-out” Power8 systems. By scale-out, IBM means a system is designed with one or two sockets and is intended to be used in clusters that have distributed applications that scale their capacity by adding multiple nodes in a loosely coupled fashion. This is distinct from “scale-up” machines, which more tightly couple server nodes and their main memory together, usually using non-uniform memory access (NUMA) technology, to create what is in essence a single large processor to run fat applications or their databases. Big Blue is also rolling out scale-up versions of its Power8 systems, which it has also promised would come this year, ahead of the Enterprise2014 event. So don’t think the Power8 rollout is only about creating a Power8 alternative to the workhorse, two-socket server based on Intel’s Xeon E5-2600 processors. (We will report on these NUMA machines, which are called the Power Enterprise Systems, in a separate story.) The new Power S824L is a Linux-only version of the existing Power S824 machine that IBM announced back in April. It is a two-socket machine that comes in a 4U chassis that has room for a dozen 2.5-inch disk drives and eleven PCI-Express 3.0 peripheral slots. This is not the skinniest of GPU-accelerated servers out there in the market by far, but IBM is betting that the memory and I/O bandwidth of the Power8 machines will give it a performance advantage compared to Xeon E5 servers using the same Tesla accelerators. The Power S824L can be equipped with two different processor options. The first is a pair of ten-core Power8 dual-chip modules (for a total of 20 cores) that run at 3.42 GHz, and the second is a part of twelve-core dual-chip modules (for a total of 24 cores) that run at 3.02 GHz. With the scale-out variants of the Power8 processors, IBM has created a dual-chip module that supports 48 PCI-Express lanes per socket with up to 32 of these lanes being able to be configured to use its Coherent Accelerator Processor Interface (CAPI) ports on the Power8. With CAPI, an accelerator based on GPUs, DSPs, or FPGAs that resides on a PCI card can link into the Power8 processor and memory complex and look like what is in effect a “hollow core” that has the same access to the memory hierarchy as the actual Power8 cores. What this means is that these accelerators do not have to move data back and forth between the CPU and the accelerator; both devices address the same memory space. Anyway, IBM created a six-core chip module with lots of PCI-Express lanes and CAPI ports and then put two of them in a single socket for the scale-out Power8 machines precisely because it wanted to be able to put lots of accelerators on them. The Power8 chip that is used in the scale-up NUMA machines in the enterprise class put a dozen cores on a single die and these have a larger memory capacity per socket and fewer PCI-Express 3.0 lanes and therefore fewer CAPI ports. (32 PCI-Express lanes, 16 of which can be used by CAPI, to be precise.) The machine delivers 96 GB/sec of I/O bandwidth per socket, and has six x8 slots available for other peripheral attachment. The Power S824L system uses IBM’s high-end and custom memory cards, which have its “Centaur” memory buffer chip on them and which have 16 MB of L4 cache memory sitting between the processor and the main memory. The Power8 chips have 512 KB of L2 cache per core and 96 MB of shared eDRAM L3 cache across the cores on the die. (On the scale out machines, it is 48 MB per die and two dies per socket). The system supports up to 16 DDR3 memory slots running at 1.6 GHz and delivering 384 GB/sec of aggregate memory bandwidth across the two sockets. The system supports 16 GB, 32 GB, and 64 GB memory sticks and tops out at 1 TB of capacity. (A whopping 128 GB memory card that IBM is making available for the generic Power S824 server that can run AIX, IBM i, or Linux is not available on the Power S824L Linux-only system.) The GPU-enabled Power8 system comes with one Tesla K40 adapter installed in a PCI-Express 3.0 x16 slot with a … Continue reading IBM Accelerates Power8 Clusters With GPUs, FPGAs, And Flash