Advanced Computing in the Age of AI | Tuesday, October 3, 2023

Stratus Moves Fault Tolerance From Hardware To Software 

In the modern hyperscale datacenter, companies such as Google and Amazon code their applications so they span multiple servers and replicate work and data sufficiently that the crash of a single machine does not bring down the application. In enterprise datacenters, this may be the way applications will work in the future, but for now, many applications span only one machine and need some kind of fault tolerance if they absolutely need to be availability.

Stratus Technologies, which had its start as a maker of fault tolerant server clusters several decades ago, still peddles such machines today. They are called ftServers, which lockstep applications running across two Xeon servers, perfectly mirroring each transaction down to the bit so that in the event of a failure of one node, the other one can just keep going. Such hardware-based lockstepping is expensive. Knowing this, Stratus worked for several years to create a virtual machine clustering program called Avance, which allowed the automatic failover of virtual machine partitions on two physically distinct servers using the XenServer hypervisor from Citrix Systems. It did not require special hardware lockstepping chipsets and precisely mirrored systems, as fault tolerant machines do.

A competitive product from Marathon Technologies, called everRun MX, embedded clustering, fault tolerance, and data replication inside the XenServer, too, and offered higher availability than Avance. Getting its hands on everRun MX is one of the reasons why Stratus acquired Marathon for an undisclosed sum in September 2012.

The new everRun Enterprise that Stratus will start shipping at the end of February shifts away from the XenServer hypervisor and is based on the KVM hypervisor that is backed by Red Hat and that is increasingly associated with the OpenStack cloud controller.

The key thing about everRun Enterprise is that it offers availability levels that compete with hardware-based fault tolerant servers. Avance provided something on the order of 99.99 percent system availability, which works out to about 52 minutes of downtime each year. It was relatively inexpensive at $5,000 per server pair and had an easy to use interface, Nigel Dessau, chief marketing officer at Stratus, tells EnterpriseTech. The ftServers push availability up to 99.999 percent, which is a little more than five minutes of downtime per year; this is the expensive option, which can cost as much as $100,000 for a pair of heavily configured server nodes.

"We can achieve five nines or more in hardware and the same through everRun Enterprise," Dessau explains. "The thing that is different is that we think of the ftServer, which we develop in conjunction with NEC and which will be updated soon, as a hardware appliance. It lets companies roll in the appliance, put on their software, and have fault tolerance. It is a hardware appliance, and it comes with a hardware appliance support model. The thing that has always been a struggle for us is when customers do not want to buy hardware from us. They have relationships with Dell, IBM, or Hewlett-Packard. With everRun Enterprise, for the first time that is what we are doing."

With the ftServer, the chipset linking two nodes together gets them to lockstep all of their work, bit by bit, as data moves through the paired systems and gets chewed on. With everRun Enterprise, the software is checkpointing every few hundred milliseconds so two different systems can keep memory and I/O and applications synchronized so if one application instance fails in a virtual machine on the primary machine processing continues on the secondary machine. It is not an HA clustering solution, where the backup node waits for the primary to fail before doing anything. It is always ready, instantly, for a crash. (This is not aimed at high frequency trading systems, obviously, with such a relatively high latency for replication between the two nodes.)

The earlier everRun MX software from Marathon was difficult to use, says Dessau, so the Avance front end has been added to it.

In a few months, says Dessau, Stratus will offer a feature called SplitSite, which will allow for the software checkpointing to be done between mirrored systems over a campus distances and still be considering fault tolerance. It will also be stretched over longer distances, of say up to 100 miles or more, where the latencies will be too great for it to be considered fault tolerant but will be an acceptable disaster recovery option. Unless you are a telecommunications firm, as one everRun Enterprise early adopter is, in which case you can get very low-latency links on the cheap for machines that are over 100 miles apart and still get acceptably low latencies between nodes to offer fault tolerance.


At the moment, everRun Enterprise is based on the KVM hypervisor that is embedded in Red Hat Enterprise Linux 6.5. It can be used on "Sandy Bridge" Xeon E5-2600 v1 or "Ivy Bridge" Xeon E5-2600 v2 servers with up to 256 GB of memory per instance and up to 24 virtual machines supported. (KVM can address more memory and run more VMs than this, and it is not clear why these caps are so low when the latent capacity of the underlying hardware is much larger.) Those VMs can run RHEL 6.5 or its freebie clone CentOS 6.5 as well as Microsoft Windows Server 2008 or Windows Server 2012. Technically, any operating system and application that is supported by KVM will work, says Dessau, but these are the ones that have been certified by Stratus.

For those who want to have fatter servers with software-based fault tolerance, the older everRun MX based on the XenServer hypervisor and with the less-slick interface could span four-socket servers. A four-processor version of everRun Enterprise will be available a bit later. Stratus intends to support both Xeon E5 and Xeon E7 processors, and it is possible that even eight-way servers could be lashed together and mirrored.

"Over the course of the year we will be moving into larger systems, and we work with customers to do this all the time," says Dessau. "These tend to be complicated to configure to ensure for good performance, so we tend to work those one-on-one with customers."

Stratus is also working on an n-node setup, which will do fault tolerance locally on a server pair and out to the cloud at the same time. This cloud version will put everRun Enterprise onto OpenStack clouds with the KVM hypervisor on each node in the cloud. You can join the everRun cloud beta program at this link.

A license for everRun Enterprise costs $12,000 for a two-node pair. The SplitSite feature for campus-wide fault tolerance or long-range disaster recovery will be priced separately as an add-on when it is available. Ditto for an application monitoring module due in the second quarter.