Advanced Computing in the Age of AI | Friday, March 29, 2024

ARM Brings More Cores To The Datacenter War 

Back in May, ARM Holdings, the chip design and licensing company behind the ARM processor collective that underpins smartphones, tablets, and other consumer devices, and that is assaulting the hegemony of the X86 processor in the datacenter, hinted that it would be coming out with a more scalable on-chip interconnect that would boost the processing capacity of devices based on its cores.

This time around, with the CoreLink Cache Coherent Network (CCN) interconnect, ARM is making another big jump, allowing companies that license its cores and interconnect technology to scale a single system-on-chip to a whopping 48 cores. This capability, which will come with the CCN-512 variant of the interconnect, puts the ARM chip core count well above that of Intel's Xeon family and getting up close to where its Xeon Phi is. (You would need a CCN-515 to get close.) At the same time, ARM will be scaling down the interconnect to with the CCN-502, allowing a very high-speed, small footprint variant that is suitable for those who want to make zippy four-core ARM chip designs.

The CCN-502 interconnect, which by its name should only support two clusters, actually supports up to four clusters of ARM cores, with each cluster having four cores and yielding a maximum of sixteen cores on a single die. That is the same as the CCN-504, and the CCN-502 has optimizations that make it take up 70 percent less area with 1 MB of share shared across the cores compared to the CCN-504 announced earlier this year. This low-end interconnect is important to pair with the Cortex-A53 core, which burns 75 milliwatts per core running at 1 GHz using Taiwan Semiconductor Manufacturing's 16 nanometer FinFET 3D transistor process, for certain workloads.

arm-interconnect-stack

The CCN-512, as the name suggests, has a dozen of these four-core clusters for a maximum of 48 cores. Here is the block diagram for an ARM chip that uses this interconnect with all of the possible features hanging off it:

arm-interconnect-ccn-512

All of the recent CCN interconnects are designed to be used in conjunction with the 64-bit Cortex-A53 and Cortex-A57 cores. Not all 64-bit ARM chip makers use these interconnects or the ARM Holdings designs, of course. Full licensees of the ARM architecture, such as Applied Micro and AMD, can design their own cores and interconnects so long as they maintain instruction set compatibility. Applied Micro has done its own core and interconnect designs from day one with its X-Gene family of chips, while AMD is shifting from using the Cortex-A57 cores in the "Seattle" and "SkyBridge" chips to a homegrown core – and possibly on-chip interconnect and off-chip glue chips for NUMA or otherwise coherent designs – with its future "K12" chip due in 2016. Cavium, also a full ARM architecture licensee, plans to have as many as 48 custom cores on a chip using its own Coherent Processor Interconnect (CCPI) in its future ThunderX ARM server chips. Cavium is also doing two-socket NUMA configurations with the initial ThunderX processors.

At the moment, ARM has not created its own NUMA or other coherent extensions to the CCN interconnect, but back in May the company hinted to EnterpriseTech that it was investigating such possibilities and that it would probably entail extending the Advanced Microcontroller Bus Architecture (AMBA) bus that has part of the ARM ecosystem since 1996. The current ARM chips support the ABMA Coherent Hub Interconnect (CHI) 5 specification, and the CCN interconnects are an implementation of that spec although not the only one; it is this interconnect that allows the mixing of ARM cores and other types of accelerators, such as DSPs, FPGAs, GPUs, cryptographic engines, packet processors, with the ARM cores on a single die and, possibly at some time in the future, outside of the die.

Ian Forsythe, product marketing manager at ARM, tells EnterpriseTech that ARM is still exploring its options for interconnects that would glue multiple system-on-chip components together, including NUMA and other kinds of coherent extensions. But at the moment, it is relying on partners to do this as their own value add. If there is such a development, Forsythe says that it will be done by extending the AMBA 5 CHI off the die.

But that is the future. Right now, what ARM chip buyers want is a set of devices that can scale from wired to wireless networks to the core datacenter networks, all of which are being overloaded by data flying in and out of our various personal and work devices as we conduct business and our lives. "While companies that make devices at the edge of the network have gotten very efficient about how to do that, but this data has to transverse the whole of the network," explains Forsythe. "People are looking for a scalable solution, and a single architecture style that they can then implement at the edge of the network and into the core of the network. And we both know that the core of the network is becoming more and more like servers."

In essence, companies want competition between X86 processors, that are moving from servers down into network and storage devices, and ARM processors, which are moving up from embedded and consumer uses up to the core network and servers. As more of the servers, storage, and switching gets defined in software, companies will probably pick one architecture over the other over the long haul. At the moment, it is a toss-up which way people will go, between X86 and ARM. We need server-class, 64-bit ARM chips in the field before we can tell, but thus far, Intel has done a masterful job of taking advantage of market opportunities outside of servers with its Xeon and Atom processors.

arm-interconnect-ccn-comparisons

The ARM collective is bringing its own resources to bear, however, with a flexible architecture that allows the mixing and matching of different kinds of ARM cores and kinds of accelerators and coprocessors. This mixing and matching is key to supporting a diverse set of workloads. Control plane processing in network devices need efficient, brawny cores, and so does MAC scheduling, which also has the added requirement of low latency. On the data plane of network devices, chips with many smaller cores that deliver lots of I/O, more deterministic performance, and bigtime throughput at higher efficiency is the key. The same is true for various storage and serving devices.

The high-end of the ARM architecture is interesting for EnterpriseTech readers because this is where ARM will take on the Xeon E5 chips, which are the workhorses of the datacenter these days. (Although for some distributed workloads, the Xeon E3s are finding uses when it comes to software that is priced based on core counts, that craves high clock speeds, and that is very parallel in nature.)

The CCN-512 allows for some truly brawny processors by any measure, sporting up to 48 cores, up to 100 GB/sec of memory bandwidth using DDR4 memory clocking at 3.2 GHz, and up to 32 MB of L3 cache spread across those cores. This setup has four DCM-520 memory controllers, which also support DDR3 memory if system makers want the older and cheaper stuff that is a little less efficient and zippy. The ARM designs have parity data on data moving around the interconnect and ECC error scrubbing on RAMs.

The maximum sustained – and measured – on-chip bandwidth is up to 1.8 Tb/sec for the I/O subsystems; the aggregate bandwidth of all of the ports on the chip using the CCN-512 interconnect is higher than this, but Forsythe did not have the numbers handy at press time. The point is, ARM measures it with workloads rather than just multiplying out everything. The SoC designs using the CCN family of interconnects also have dynamic voltage and frequency scaling and CPU shutdown down to a core, with partial and full L3 cache shutdown. The chip design also includes retention modes so that data in the cache is preserved when the cores are all turned off.

The new CCN interconnects will be available in products sometime next year, and it will be interesting to see if they get used to brawny ARM server chips. The capability is there, now someone has to use it.

EnterpriseAI