Cloud Builders Push 25G Ethernet Standard
Unhappy with the cost per gigabit of bandwidth with current Ethernet switches and adapters, two of the cloud computing giants – Google and Microsoft – have teamed up with two switch chip providers – Broadcom and Mellanox Technologies – and one upstart switch maker – Arista Networks – to create a specification for Ethernet speeds that are different from those officially sanctioned by the IEEE. To get better port density and lower costs, the cloud providers want Ethernet to run at 25 Gb/sec and 50 Gb/sec inside the rack rather than the 10 Gb/sec, 40 Gb/sec, and 100 Gb/sec speeds that are currently available.
With the exception of the initial 3 Mb/sec and 10 Mb/sec speeds set back in the early 1980s when Ethernet was just starting to be commercialized after running in the Xerox PARC labs for nearly a decade, Ethernet speeds have been created by the group of networking vendors participating in the IEEE. In March, the IEEE had a meeting in China and Microsoft put out what is referred to as a "call for interest" or CFI with the idea of establishing a 25 Gb/sec Ethernet standard with a 50 Gb/sec speed bump for certain applications. This CFI did not get ratified. But cloud builders like Google and Microsoft have compelling reasons to want to have such speeds for their Ethernet, and so they are going to form a consortium of their own, called the 25 Gigabit Ethernet Consortium of course, and make it happen, while at the same time adhering to the IEEE 802.3 specification that governs Ethernet.
Anshul Sadana, senior vice president of customer engineering at Arista Networks, explained the situation to EnterpriseTech. With a 40 Gb/sec switch, the current Ethernet specification calls for four lanes of traffic coming off the serializer/deserializer (SerDes) chips running at 10 Gb/sec speeds. (Actually, because of encoding overhead, Sadana explained, they actually run at 11.25 Gb/sec, but this is not how people talk about it.) On a 100 Gb/sec link, there are two ways to get there: ten lanes running at 10 Gb/sec speeds or four lanes at 25 Gb/sec speeds. The use of these parallel links (running at 10 Gb/sec or 25 Gb/sec) leads to some design choices in both network interface cards and in switches that the cloud providers say do not align with their needs. While they are always happy to have more bandwidth, they are not willing to get it at a higher cost or with lower port density on their switches.
"With 40 Gb/sec, if you need four lanes, there are more elements on the switch chip, there is more power drawn, and this results in a lower port density compared to what could be achieved with single-lane devices," says Sadana. "SerDes have evolved from 1 Gb/sec to 10 Gb/sec to 25 Gb/sec, and they are all aligned to achieve the IEEE speeds. But parallel lanes are not the most cost optimized, and especially for the datacenter where you can put a large amount of servers in a rack and you need the right uplinks for those servers."
So, for instance, in a standard rack today, a top-of-rack switch running at 10 Gb/sec might have 48 downlink ports to servers with maybe four or eight uplinks to the aggregation layer of the network. But if you move to 40 Gb/sec downlinks to the servers, switches typically only have 32 or 36 ports – and that is not enough to cover the machines in the rack and you end up having to buy two 40 Gb/sec switches and having orphaned ports.
While the cost per gigabit has come down over time from 1 Gb/sec to 10 Gb/sec to 40 Gb/sec as switches have evolved, Sadana says that a single lane of 25 Gb/sec has one pair of wires and is the sweet spot in terms of the lowest cost per gigabit. There is not a perfect apples-to-apples comparison between the proposed 25 Gb/sec standard and the real 40 Gb/sec switches out there on the market, which Sadana concedes. But if you do some math on the back of an envelope, as all of the members of the 25 Gigabit Ethernet Consortium have done, then a switch using a single-lane running at 25 Gb/sec will draw somewhere between one half and one quarter of the power at the device level, and you get somewhere between 2X and 4X the port density on the switch and on the network interface. Over time, as 25 Gb/sec Ethernet switches come to market, Sadana predicts that 25 Gb/sec ports will cost less than twice what a 10 Gb/sec port does. The math the cloud builders are doing is that it will provide 2.5X the bandwidth of 10 Gb/sec Ethernet at 1.5 times the cost in half the power envelope with a lot higher port density.
Just about any Ethernet switch offers sub-microsecond latency these days, with somewhere between 500 nanoseconds and 1 microsecond being typical, and for Web-style, cloud-hosted applications, this port-to-port latency is fine. "If you are in this range with a search engine or analytics application, for instance, everything is good," says Sadana, making obvious exceptions for high frequency trading and other similar workloads where low latency (and the consistency of that low latency) is more important than raw bandwidth. The point is this: because of the relatively high cost of 40 Gb/sec switches and adapters, enterprises and cloud builders alike are looking for an alternative.
"If you choose 40 Gb/sec, you have to pay a premium, and as a result, many of the large cloud providers and large enterprises would not move to 40 Gb/sec until it is more cost effective – and that may not happen for many, many years to come," says Sadana.
While the proposed 25 Gb/sec spec put forth by the consortium is good for linking servers to each other with faster uplinks to the aggregation layer, some applications need even more oomph. And to that end, the consortium is proposing using a pair of 25 Gb/sec links to create a 50 Gb/sec Ethernet speed. While this doubling up of the lanes does require beefier chips in the switches and on the network interface cards, this supplies 25 percent more bandwidth than a 40 Gb/sec switch with half the number of lanes. These 50 Gb/sec ports are aligned to the bandwidth and cost profile of cloud storage and any other application that needs higher bandwidth.
Sadana says that the consortium is not trying to recreate Ethernet, but is merely are adding a few tweaks on top to support 25 Gb/sec and 50 Gb/sec and the autonegotiating that will be required to make the new speeds interoperable with existing Ethernet speeds. The consortium members are going to make the specs for the 25 Gb/sec and 50 Gb/sec variants available to any vendor, datacenter operator, or end user that joins the consortium. The members are working on a standard definition of the physical layer (PHY) and media access control (MAC) for these two speeds, and this also includes specs for virtual lane alignment and forward error correction as well as the autonegotiation mentioned above. It will take about six months at least to get this spec hammered out, says Sadana, and it takes six months to a year to get an ASIC design out the door. So don't expect products compliant with 25 GbE or 50 GbE any time before the summer of 2015 or early 2016 – and it could take a little bit longer.
Kevin Deierling, vice president of marketing at Mellanox, tells EnterpriseTech that having just demonstrated its first 100 Gb/sec InfiniBand switch last week, the groundwork for creating ASICs for switches and adapters as well as cabling that supports the proposed 25 GbE and 50 GbE specs is not a big deal. "In terms of the core, underlying technologies we have sort of blazed the trail and the cabling, the SerDes technology, and the core process technology is all the same. Yes, it is a new set of ASICs, but 25 GbE is really just a subset of 100 Gb/sec."
Mellanox is working on an end-to-end solution it can bring to market in one fell swoop, and believes that customers should be looking to 100 Gb/sec at the core switch layer at this point, with 25 Gb/sec, 40 Gb/sec, and 50 Gb/sec at the top-of-rack, depending on their needs. The important thing, says Deierling, is to get the advantages of 25 Gb/sec signaling wherever it makes sense. As for timing, Mellanox is not saying when 25 GbE and 50 GbE products will come to market, but reiterated that it is slated to get 100 Gb/sec InfiniBand to market in late 2014 or early 2015, with Ethernet "shortly to follow" running at that rate. These new Ethernet offerings will come to market after that.
With Broadcom and Mellanox behind the effort and Google and Microsoft in line to be buyers, it will be interesting to see if Intel, Cisco Systems, Hewlett-Packard, and others who make their own switch ASICs will back the effort. It also seems likely that Amazon Web Services and Facebook will join the fray, and perhaps even some supercomputing centers that are wrestling with the same networking issues. Once switches and adapters are out that support these new speeds and they are shown to be Ethernet through and through and interoperable with other Ethernet switches, and if cloud builders and large enterprises start adopting them, you can bet that other networking companies will join up and may even call for IEEE to bring these into the Ethernet fold. At the moment, the IEEE is focused on looking ahead, with the development of a 400 Gb/sec Ethernet standard.