Advanced Computing in the Age of AI | Friday, April 19, 2024

Is Amazon Making Its Own ARM Server Chips? 

When computing is all that you do, at a certain scale it makes sense to control all elements of the IT stack, from the operating system kernel all the way to the wall of the datacenter. None of the big hyperscale datacenter operators and cloud providers do so yet, but as they grow and need to wring even more efficiencies from their infrastructure, the need to differentiate at the chip level is growing. That is one reason, perhaps, that cloud giant Amazon Web Services could be looking to build its own ARM processors.

The circumstantial evidence that AWS is looking to make its own ARM chips is pretty strong, as it was a few years back at Google when the search engine giant snapped up a bunch of experts from the PowerPC realm. As far as anyone knows, Google has not created its own Power chip variants and had them etched by one of the several big remaining foundries, but the company is heading up the OpenPower Foundation and it has shown the world its own two-socket Power8 motherboard.

The consensus among the hyperscale players, when you can even get them to talk in generalities, is that they like to have two sources for silicon and if possible, three. The X86 chips from Intel and to a lesser extend AMD dominate the datacenters of Google, Amazon, Facebook, Yahoo, eBay, Twitter, and their peers, but that is as much out of necessity as it is by choice. Up until now, IBM has kept a tight rein on its Power processors and the ARM collective has not fielded a 64-bit processor that is ready to ship in volume with a production-grade software stack. While ARM suppliers are getting closer, despite the demise of upstart Calxeda back in December, no one expects a ramp of products until early 2015.

Amazon, like other hyperscale players, doesn't have that kind of patience and it can afford to do its own chip, systems, and datacenter development if that suits its needs better than waiting for a commercial product to be engineered and manufactured. This could be what is behind a bunch of recent hires by Amazon Web Services from the former Calxeda, which first came to light at GigaOm this week.

The scale of the investment to make an ARM processor aimed at servers is not trivial, as the rise and fall of Calxeda shows, but having a captive customer for the product is something that Calxeda did not have but which Amazon most certainly has.

Amazon, which is particularly secretive about everything it does when it comes to infrastructure, refused to comment on the talk about the company creating its own ARM chips.

But the idea stands to a certain amount of reason. Amazon used to buy its servers from Rackable Systems and then SGI after it bought that venerable supercomputer maker and took its name. Then, like its peers, Amazon has designed its own servers and storage, farming out the manufacturing to unknown third parties who rack and stack the gear and deliver it in what we presume is a nearly steady state to the company's datacenters. Server chips, as Intel's financials show full well, are still a profit center for the company, and that means it is probably a place where hyperscale datacenter operators think they may cut some costs.

Moreover, Amazon completely controls its software stack, which includes variants of Linux and the Xen hypervisor, both of which have been ported to work with the ARM architecture in recent years. Microsoft is not yet delivering a version of Windows Server for ARM chips – mainly because there are none on the market that are ready – but it is logical that Microsoft should keep its eye on ARM chips for its own use on its Azure cloud should they turn out to offer a competitive advantage above and beyond the chip malleability and multi-sourcing that ARM Holdings has built into its business model as it licenses ARM instruction sets and whole core and system-on-chip designs. All of the big cloud providers are spending billions of dollars a year on infrastructure, and at that scale tuning an architecture specifically for the job does end up saving money that would otherwise be spent on space, power, and cooling.

Because ARM is a group effort, it is less costly to make variations, and this is one of its key advantages. Andrew Feldman, who is in charge of the Data Center Server Solutions business unit as well as the Server CPU business unit at AMD, put some numbers on this at the ARM TechCon 2013 conference last October.

"I think one of the great advantages of ARM is that it scales down the process of building a CPU," Feldman explained. "It makes it a tractable problem. In the end, we're going to talk about the advantages of power. We're going to talk about the advantages of space. But at the end of the day, the fundamental advantage that ARM brings is this: if I build an X86 CPU today, it takes me four years and $400 million, whereas we can do an ARM server CPU in eighteen months for $40 million. And that's a tough force to beat."

This is precisely the kind of force that the hyperscale operators like to play with. We have to assume that these companies already have the best volume pricing available from Intel and AMD and we know that they can get their own instructions and features added to the chips as well as their own unique mix of clock speeds, voltages, and thermal envelopes. But operating at the megascale – with millions of machines in their fleets – this may not be enough.

What we know for sure is that Amazon Web Services this week posted a job opening for a CPU and System Architect and that if you troll around on LinkedIn, you can see from of the top people at Calxeda are now getting their paychecks from AWS. David Borland, who was one of the co-founders of Calxeda and vice president of hardware engineering, is now director of silicon optimizations at AWS, according to his LinkedIn profile. Borland did stints in various jobs at Marvell, Intel, and AMD before that. Mark Davis, who was chief architect and CTO at Calxeda, is now principal engineer for silicon optimization at the cloud giant. Davis was a senior kernel engineer at supercomputer maker Convex Computer back in the day, and spent five years designing NUMA machines at Newisys. Danny Marquette, who was director of system-on-chip engineering at Calxeda, is now manager of hardware engineering for silicon optimizations at AWS. Marquette was a chip designer at Motorola before that, working on 68K series chips, and then at Analog Devices on various digital signal processor (DSP) chips. Tom Volpe, who was the leader of the first and second generations of fabric interconnects that were part of the EXC line of ARM chips from Calxeda, is now hardware development manager for silicon optimization at AWS; he also worked previously at Analog Devices and Motorola. Bianca Nagy is now a hardware design engineer for AWS, and at Calxeda she not only designed the company's own EnergyCard reference boards but also helped customers to create their own system boards for the EXC-1000 processors. Prior to Calxeda, Nagy was a system board designer at Dell, working on server and storage products.

One key Calxeda employee who apparently has not jumped to AWS is Prashant Chandra, who was a server architect in charge of Calxeda's third generation of fabric interconnect, which was intended to scale above 100,000 nodes in a single fabric. According to The Register, Chandra has taken a job at Google, but thus far he has not updated his LinkedIn profile to show a job change. Chandra only joined Calxeda in May 2012, and prior to that was the chief architect of Intel's Light Peak server disaggregation interconnect, which was eventually moved over to the Open Compute Project to be known as the RackScale Architecture. Chandra also created the Thunderbolt interconnect for PCs and worked on C compilers networking products at Intel before that.

The distributed, on-chip Layer 2 interconnect on the EXC processors probably cost on the order of tens of millions of dollars to develop, and it is a bit surprising that Calxeda has not been able to license this technology as yet with so many chip makers and hyperscale customers looking for a way to simplify their infrastructure and cut some costs. The initial Fleet Services fabric from Calxeda was designed to scale up to 4,096 nodes in a mesh or torus interconnect without the need of top-of-rack servers, but as far as we know, no one was pushing those limits in any trials with the EXC-1000 chips. The second generation Fleet Services fabric had some performance improvements, but the third generation was the big step, scaling up to over 100,000 nodes in a single fabric using those embedded L2 switches.

While the forthcoming "Seattle" ARM chips from AMD and "X-Gene" chips from Applied Micro will have onboard Ethernet controllers, this is not necessarily as useful as an on-chip distributed Layer 2 switch. This approach does, however, have the virtue of being easier. AMD has pushed out the integration of its SeaMicro Freedom fabric to its next generation ARM chip, and Applied Micro is relying on its X-Weave gearbox connectivity to link multiple X-Gene server nodes together. Other 64-bit ARM server chip makers have not talked about their plans as yet, and ARM Holdings does not have plans at this time to provide chip-to-chip connectivity across ARM SoCs.

Amazon has experience with ARM chips already, of course. ARM processors from Texas Instruments were used in its early Kindle tablets, and now they are using processors from Qualcomm in the zippiest of the Kindles. Amazon does not design their own processors, but obviously leans on its vendors to get customizations.

The thing to remember is that just because Amazon has hired a bunch of server experts from Calxeda does not mean it is going to design its own processor, no matter how much we might want it to do so. It is equally possible that AWS needs some experts to help it sort through the 64-bit ARM chip options and get the modifications it wants for its own systems. That said, we would be the first to agree that Amazon making its own server chips – no matter what the architecture – would be a fun thing, indeed.

EnterpriseAI