Advanced Computing in the Age of AI | Friday, March 29, 2024

Benefit From Microsoft’s Open Compute Hyperscale Designs 

All large companies, whether they are hyperscale datacenter operators or cloud providers or they are enterprises supporting a diverse mix of workloads, wrestle with the polar opposites of creating unique systems to run particular applications and offering standard systems that can run most applications well enough. Standardization makes purchasing and support easier, while buying a specifically configured machine can result in better performance and bang for the buck.

There is a tricky set of constraints that have to be weighed, and as the datacenter scales, standardization and automation get increasingly important. Something that is a pain in the neck with a hundred servers becomes impossible to bear with a million machines.

Microsoft has learned much about these issues as it has become both a public cloud provider and a hyperscale datacenter operator, and rather than hoarding that knowledge, the company open sourced its server designs back in January and contributed them to the Open Compute Project. And this week, it is letting loose its second generation of homegrown servers through the OCP and has gone the next step and fostered a supply chain and manufacturing base of these machines. This not only helps Microsoft, which wants competitors bidding to build its systems, but anyone else who wants to use minimalist server designs in their datacenters. This was, in fact, the whole point of the Open Compute Project that Facebook started three and a half years ago. The idea is to have many people and organizations hack servers, storage, and networking much as open source software projects have done with operating systems, middleware, and other software and to share the results of that work.

 

microsoft-cloud-server-racks

At the Open Compute EU Summit, which is being hosted this week in Paris, Microsoft is bragging a little bit about the size of its cloud and how it has designed its Open Cloud Server (OCS) v2 machines to walk that fine line between providing low-cost, efficient systems and being able to support diverse workloads with different configurations. But the most important thing is that a number of key original equipment manufacturers (ODMs) are ready to sell variants of the new Open Cloud Server v2 machines, so others can mimic Microsoft and benefit from its engineering work.

The amount of investment that Microsoft has done is staggering. In a blog post, Kushagra Vaid, general manager of server engineering for the Global Foundation Services unit at Microsoft, which runs its datacenters, said that Microsoft has spent more than $15 billion in its global cloud infrastructure, which literally spans the globe. Microsoft supports more than 200 different cloud services that are used by more than 1 billion users and 20 million businesses across more than 90 countries. Earlier this year, Microsoft had more than 8.6 trillion objects stored in the Azure storage cloud and more than 250 million users of its SkyDrive storage service. It had more than 400 million active Outlook.com accounts, processed more than 5.8 billion queries a month on its Bing search engine, and was handling more than 50 billion minutes of connections on Skype every day. Microsoft has never said how much of its capacity is dedicated to its infrastructure and platform services on the Azure public cloud, but it is very likely still a very small portion of the overall infrastructure.

"Over the last six months, the OCS v2 design has been thoroughly tested in our own datacenters, from powering IaaS and PaaS services in Windows Azure, to hosting e-mail and collaboration services in Office 365, to hosting latency-sensitive gaming services in Xbox Live," Vaid explained. "Converging onto a unified, flexible design allows us to optimize the economics of our supply chain, while delivering a diverse array of cloud services from one underlying server platform."

microsoft-ocp-server-design

The basic shape of the Open Cloud Server v2 has not changed from the v1 spec that Microsoft announced back in January. The Open Cloud Server chassis is 12U high and has two dozen half-width server or storage trays that slide into the enclosure. Each tray is 1U high, which means it can make use of standard memory, disk, and network components that are designed for normal 1U rack-mounted servers. Because price is important, Microsoft's initial Open Cloud Servers used 3.5 inch SATA disk drives instead of smaller, faster, skinnier, and more expensive 2.5-inch SAS drives. This is consistent with what other hyperscale datacenter operators do. The server trays have four drives and a single two-socket server motherboard, while the storage trays have room for ten drives. Both plug into a passive signal backplane that links the nodes together and to storage and Ethernet modules at the back of the chassis and that also provides the 12 volt power for each tray. Rather than have a power supply and set of fans for each node, as is common in rack servers, the power supplies and fans are both larger and shared across the chassis, as is commonly done in blade and converged platforms.

Microsoft says the Open Cloud Server is about 40 percent less expensive than prior machines it used (presumably including both capital and operational expenses, which go down because the machines are made to be easily maintained) and provides about 15 percent better power efficiency, too.

With the Open Cloud Server v2 update, Microsoft is shifting to the new "Haswell" Xeon E5-2600 v3 processors from Intel, and significantly the server nodes can support up to 14 cores but cannot support the top-end Haswell Xeon E5s that have sixteen or eighteen cores. The updated system also supports 40 Gb/sec Ethernet as an option to the 10 Gb/sec Ethernet that was in the original chassis. Importantly, the network interfaces in the chassis also support the updated implementation of Remote Direct Memory Access over Converged Ethernet, or ROCE for short. Microsoft has adopted InfiniBand and its low-latency RDMA for selected workloads in its cloud, and ROCE is the implementation that works over Ethernet. (The network interfaces on the nodes support ROCEv2, which was just announced a few weeks ago with some tweaks to improve its performance.) Simply put, RDMA allows nodes in a cluster to bypass the operating system and network driver stack to place data directly into each other's memory for processing, and on parallel clusters, this radically cuts down on the network overhead and therefore boosts the throughput of the machines.

With the v2 rendition of Microsoft's iron, the company is also talking about how it supports a wider variety of add-on cards in mezzanine slots, including the FPGA accelerators that it created to juice portions of its Bing search engine page ranking algorithms. For the first time, Microsoft is also supporting m.2 flash memory modules, which are being increasingly used on blade and hyperscale servers for loading operating systems, application software, and small datasets. The new Open Cloud Server v2 server nodes support larger memory capacities, ranging from 128 GB to 256 GB, in their thermal profile, and also include a 1,600 watt power supply for the chassis that has what is called a high hold uptime of 20 milliseconds. What this means is that the power supply can lose input power for as long as 20 milliseconds and ride it out without losing power to the server and storage nodes it is feeding in the chassis. Finally, the Open Cloud Server v2 chassis supports different power standards that match the voltage and amperage standards in various countries around the globe.

Quanta's Microsoft-inspired Open Cloud Server v2 trays

Quanta's Microsoft-inspired Open Cloud Server v2 trays

 

While the nitty gritty details of the Microsoft server designs are important to the engineers who are picking systems to run workloads, the fact that a number of key suppliers of Open Compute machines are ready to make and ship them now is important to the top brass. Quanta QCT, Wiwynn, ZT Systems, and Hyve Solutions are lined up to take orders. Hewlett-Packard has built actual Open Cloud Servers for Microsoft, as EnterpriseTech has previously reported, but has not yet formally productized them. Dell's Data Center Solutions custom server unit was also showing off its variant of the original Open Cloud Server earlier this year, too.

The Quanta machines are known as the Rackgo M series, and the MC510 compute tray supports sixteen DDR4 memory sticks running at 2.13 GHz and interestingly swaps put the 3.5-inch drives in the original Microsoft design for four hot-plug 2.5-inch disk drives that mount in the front of the enclosure and four 2.5-inch solid state drives that are fixed to the inside of the tray. The MS100 storage blade in the Rackgo M series supports ten 3.5-inch disk drives in 6 TB capacities, for a total of 60 TB per tray and a maximum of 1.44 PB in a 12U enclosure that is configured solely as storage. Quanta has options for SAS drives and SSDs in the 3.5-inch form factor.

Wiwynn's Open Cloud Server v2 nodes

Wiwynn's Open Cloud Server v2 nodes

Wiwynn, which is the custom system arm of Taiwanese contract manufacturer Wistron, also has its own twist on the new Microsoft-donated system design. Wiwynn's SV5270G2 offers up to 1 TB of main memory on the server tray, more than the Microsoft spec. The Wiwynn alterations also include four 3.5-inch disks and four SSDs on a server tray, and as you can see from the picture above, there is a variant with only two disks and another system component that was not identified. The server trays can also be equipped with a RAID disk controller and two external mini-SAS HD ports to link to external storage arrays.

Wiwynn is also rolling out the SV7220G2, which is a follow-on to the three-node "Winterfell" inspired Open Compute server. This node supports up to six SSDs per node and is aimed at workloads that need high I/O from storage as well as lots of compute. This machine is also based on the latest "Haswell" Xeon E5 processors from Intel and is certified to be compliant with OCP specs.

ZT Systems put out a statement about supporting the Open Cloud Server v2 spec from Microsoft, but has not yet divulged the feeds and speeds of the machines on its site. The company did, however, show off the iron at the Paris event. Hyve Solutions, the custom system arm of contract manufacturer SYNNEX, is also unveiling its own Open Cloud Server v2 machines at the Paris show, but has not posted the specs yet, either.

The important thing is that these companies support the Microsoft-donated OCP specifications and you can buy them without having to do a from-the-scratch engagement with a manufacturer.

The fun bit would be to see if any nodes made by the six manufacturers mentioned above can be interchanged in a single chassis and still work properly. This cross-vendor support within a single chassis is common in telecommunications gear, using the AdvancedTCA form factor and interconnect standards, which the telco industry created because its machines have such a long life in the field. (Decades sometimes instead of three, four, or five years with enterprise servers.) If OCP means anything, it means creating a similar standard for enterprises, and that should mean any server tray or node can plug into any compliant chassis or enclosure.

EnterpriseAI