Calxeda Gearing Up To Push ARM Server Clusters To The Extreme
It is important for a startup to not do too many things at once – to develop a roadmap aimed at an emerging market, and to stick to it.
This is precisely what Calxeda, one of the early makers of ARM server chips, is doing. Calxeda has plenty of competition these days, but is not rattled by any of that. It has plans to deliver chips with lots of ARM cores and on-chip networking that can scale to 100,000 server nodes – and possibly even more – in a single fabric. It will take a few processor and interconnect generations to get there, but when Calxeda reaches its goal, it will have created a platform that can run heavy-duty analytics and enterprise applications.
Calxeda is not about to prelaunch any products, but Karl Freund, vice president of marketing at the company, spoke to EnterpriseTech about the goals for "Midway," its upcoming EnergyCore ARM chip as well as the long-term processor and interconnect roadmap for Calxeda's chips and how this maps to the particular customers that the company is chasing. Or more precisely, that it hopes system builders will chase with its chips because Calxeda has no intention of being in the motherboard or server businesses.
As the name suggests, the Midway chip is a step halfway between the current "High Bank" ECX-1000 processor, which had 32-bit processing and memory addressing and was based on the Cortex-A9 core and ARMv7 specification from ARM Holdings, and the future "Lago" chip, which will be based on the ARMv8 specification and has 64-bit processing and memory addressing. Not every application requires 64-bit processing or addressing – there are plenty of media streaming applications that don't need the extra bits, for example. So that means there is a market for the Midway processor.
Calxeda is expected to deliver Midway to its systems partners before the end of the year. The chip taped out at the end of the first quarter, right on time, and the foundry had chips back in Calxeda's hands in early summer.
"With Midway, we are trying to make sure that the whole system is there for private cloud computing," explains Freund. "We have to enable virtualization, with KVM and Xen. And we have to do it in a way that works well on our fabric. We're working as much on the software as we are on the hardware."
The changes to the Fleet Services fabric switch that is embedded on the Calxeda ARM chips will be implemented in an Ethernet driver for Linux, which will be upstreamed to the Linux community and picked up by the Linux distributors who support ARM processors. (At the moment, that is Canonical with Ubuntu Server and Red Hat with its Fedora development release.) There is no Windows Server version for ARM processors – not yet, but a new Microsoft CEO might feel differently about that – so there is no need to worry about how to get this switch functionality into Windows.
Calxeda is also working with Linaro, the Linux-on-ARM project at the Linux Foundation, to ensure that the Linux kernel is able to support the Large Physical Address Extensions (LPAE), which allows for the 32-bit memory addressing in the Cortex-A15 core to be mapped to a 40-bit address space on a server. And at the operating system level, Calxeda is working very closely with Canonical so Ubuntu Server, which includes an integrated OpenStack cloud controller, works well on its forthcoming Midway processors.
Conceptually, here is how Calxeda sees its target markets growing as it adds more processors to the EnergyCore family:
That first Calxeda chip, the ECX-1000, had four Cortex-A9 cores running at between 1.1 GHz and 1.4 GHz, with an integrated scalar floating point unit and an ARM Neon multimedia engine for SIMD calculations (it has 64-bit and 128-bit registers and can also do floating point math). Each core on the ECX-1000 has 32 KB of L1 data cache, 32 KB of L1 instruction cache, and the four cores on the die share a 4 MB L3 cache. The chip has a DDR3 memory controller that can support regular 1.5 volt or low-power 1.35 volt main memory, and it topped out at 4 GB per socket. The ECX-1000 can drive four PCI-Express 2.0 slots (x8) and one PCI-Express 1.0 slot (x8 as well), plus has a SATA 2.0 controller that can drive five ports at 3 Gb/sec. It has a management controller, which manages the cores and an integrated switch.
The fabric switch implemented on the chip is what makes the Calxeda design special, and it has an 8x8 crossbar between the cores and the outbound Ethernet ports that has 80 Gb/sec of bandwidth. Specifically, it has three 10 Gb/sec channels that are used to link four sockets together on a single system board so they can talk to each other without having to go to a top-of-rack switch. There are another five 10 Gb/sec channels that come off the system board and allow up to 4,096 nodes to be linked to each other in a number of different topologies – again, without external switches. The Fleet Services fabric can be used in a 2D torus, mesh, butterfly tree, and fat tree network configuration. The bandwidth can be dynamically allocated to each virtual port coming off the processor in 1 Gb/sec, 2.5 Gb/sec, 5 Gb/sec, and 10 Gb/sec chunks.
The details are a bit thin on Midway, but Freund says to expect about 50 percent more integer performance at the same frequency as the EXC-1000, which ran at between 1.1 GHz and 1.4 GHz. Calxeda is not saying what process was used to make the Midway chip, but the ECX-1000 was implemented in a 40 nanometer process and is etched by Taiwan Semiconductor Manufacturing Corp. There could be a process shrink to 28 nanometers, but Calxeda has not confirmed this. It would be wise to use Intel's tick-tock method, not changing both the design and the process at the same time, but then again, it all depends on who the foundry is and what kind of a hurry Calxeda is in. We will probably find out in a month at ARM's TechCon conference, when all of the ARM server chip makers are expected to strut their stuff.
Calxeda has also said that floating point performance will double with Midway, and each thread on the chip will be able to support 4 GB of memory, for a total of 16 GB per socket. That should mean that Midway is a four-core chip, just like its predecessor. The performance per watt should be about the same on Midway as on the EXC-1000, which also seems to indicate there is not a processor shrink coming. The next iteration of the Fleet Services fabric will have individual power domains for every fabric link, Freud says, plus a number of other energy efficiency tweaks to make the fabric burn less power.
Calxeda has much bigger plans beyond Midway, so don't think it is just sitting by as Advanced Micro Devices, Applied Micro, Marvell, Cavium, and possibly Samsung all chase the ARM server bandwagon and try to hop on. Here is another roadmap that Calxeda is showing to partners and customers that shows how its future EnergyCore chips will evolve and the target markets for them:
The Lago EnergyCore processor is where things will get interesting in the extreme scale zone where EnterpriseTech lives, and it is also the chip that Calxeda expects to help it move from public and private clouds into real-time analytics and massively parallel applications and databases.
The Lago chip will have more cores on the die – how many, Freund is not saying – and also have faster single-thread performance. (The chart above says 2X performance, but that is vague, and intentionally so.) The 64-bit Neon floating point unit will be added to the 64-bit ARMv8 cores used to make the Lago chip, and interestingly a third generation of the Fleet Services fabric switch will be able to scale to more than 100,000 nodes. That is a very large number of top of rack switches that can be eliminated from a cluster, and provided the Linux distribution is good and applications are coded for or ported to it running on the Lago chips, if some server maker supports such a configuration we could see some very powerful clusters running on this chip.
Of course, if Intel or AMD put their own distributed switch fabrics down on their chips in the next year or so, then Calxeda could lose such a big and interesting advantage.
Calxeda is not saying much about its future "Ratamosa" and "Navarro" chips, except to put them on the roadmap. Ratamosa will be aimed at HPC and enterprise applications, and will continue the push for more performance. Navarro is merely explained as being part of the "enterprise server era," whatever that means. If Calxeda can keep an annual cadence, then Lago should be out around the end of 2014 and Ratamosa should come out towards the end of 2015. By then, Microsoft might even have Windows Server ported to ARM chips. It is a big might.
Not Interested In Being A Server Maker
What Calxeda is not going to do this time around with Midway and with future processors is get bogged down in making system boards that server makers buy to create production products. Calxeda will make development cards as it has been doing for its first generation chips, but it is warning server makers that these cards are not warrantied for production use.
Freund says that Calxeda has over a dozen different Midway motherboard designs in flight, with about half of them coming to market in products in the middle of next year. The important thing is that these are all cost-optimized motherboard designs for very precise systems and workloads, unlike the original quad-node EnergyCard boards, which were only intended for software development.
One of the original design manufacturers (ODMs) that Calxeda is working with is building a storage array to run the open source Ceph network file system, which is compatible with Amazon's S3 and OpenStack's Swift object storage. Ceph also has a block storage overlay called the Rados Block Device that makes it very useful for applications, such as those driven by databases, that require block rather than file or object access. These Ceph file systems typically do a lot of data replication, keeping three copies of data, so if one node in the clustered file system goes down, it can just re-route requests to the other two nodes and makes another third copy as it is working. The Fleet Services fabric is being optimized for this east-west traffic between the Ceph nodes while also providing acceptable performance out to the servers requesting data.
Another design win involves a company creating a streaming media server, and they have a different fabric design, one that is aimed at getting as many pipes as possible all pointing in the same direction.
"We are finally starting to see people understanding how to exploit fabrics in very specific applications and designing them to be optimized for those workloads," says Freund. "Many of these workloads work fine on 32 bits, but they could use a little more performance, so along comes Midway."
It is not just about having an ARM processor. And Calxeda has understood that from the beginning, way back when it was in stealth mode under the name of Smooth-Stone when it was founded in January 2008. That is a reference to the rock that David picked up and put in his sling to kill Goliath, by the way.
Of course, Calxeda has lots of competition in low-powered processor market these days, even though it got an early jump on ARM and X86 alternatives. Calxeda was first out the door with a server-class ARM chip when it debuted the ECX-1000 in November 2011. This was the first processor in the "Redstone" Moonshot systems from Hewlett-Packard, which were a development platform to test the idea of using energy efficient processors to pack a lot of small server nodes into a rack.
This year, HP has launched its "Gemini" Moonshot systems, which are the commercial products, not a development platform. Being a commercial product, the Moonshot chassis will have a variety of Atom, Xeon, Opteron, and ARM processors as well as GPU, FPGA, and DSP accelerators. A combination ARM-DSP processor called KeyStone-II from Texas Instruments was sighted in the field at the International Supercomputing event in June, in fact, and so was a Calxeda node with four independent sockets. It was not clear if that Calxeda Moonshot card we saw at ISC 2013 was packing the Midway processor, but it could have been.
HP calls the server nodes inside the Moonshot 1500 enclosure "cartridges," and it can put 45 such nodes into that enclosure, which is 4.3 rack units high. Assuming you use a rack that is a little taller than standard, you can in theory get 450 cartridges into a single rack, and with the four-node cartridges, that comes to 1,800 nodes in a rack. That is a lot of nodes.