Advanced Computing in the Age of AI | Saturday, December 9, 2023

Intel Software To Guarantee Virtual Machine Performance 

Virtual machines do not always behave themselves and Intel wants to change that through a combination of circuitry in its processors and software that it develops in-house and plugs into hypervisors and cloud controllers.

This software effort by Intel is yet another example of the company trying to get out in front of problems it sees in the datacenter to help shape the way the problems get solved, explains Billy Cox, who has been tapped to be general manager of its Service Assurance Administrator products. The Datacenter Group at Intel has created management software to help companies monitor and cap the power consumption and thermal dissipation of servers at the node and rack levels. Intel has already used its Trusted Execution Technology, or TXT for short, to secure hypervisors and the virtual machines that run atop them.

This low-level system management software is not the only kind of software that Intel develops, of course. Intel bought Whamcloud a few years back because Lustre needed a big company standing behind it to make sure it would be updated and thrive in commercial datacenters as well as in the supercomputing labs where Lustre is a popular alternative to IBM's General Parallel File System. For a while, at the behest largely of companies in China who wanted Intel to stand behind Hadoop, Intel created its own commercial Hadoop distribution, too. The company recently mothballed this effort, invested a huge sum of money in Hadoop juggernaut Cloudera, and anointed Cloudera as the preferred Hadoop distribution that will get help from Intel getting it tuned up on Xeon servers and, presumably, various Intel interconnects. Intel was pitching the combination of its Lustre and Hadoop distributions as a big data platform. Intel is also a big contributor to the Linux kernel and to projects like OpenStack. Intel wants such open source programs to run best on Intel iron, and it pays software engineers to work on that every day.

The new Datacenter Manager Service Assurance Administrator code – abbreviated DCM:SAA in the Intel lingo – aims to solve a few problems associated with server virtualization. The big one is meeting service level agreements on private and public clouds based on Xeon servers.

"The idea is to create a VM and give is a performance target and then use the deep platform instrumentation to ensure that performance no matter what CPU generation, clock frequency, cache size, and other variations are," explains Cox. "You want to have a service level agreement, and you want to know that you can hit that service level agreement. All you can do today is best effort."


Best effort is not the same as an SLA, as anyone who runs applications on any virtualized infrastructure will tell you. Today, in a KVM, Xen, ESXi, or Hyper-V environment, a cloud provider sizes up a set of virtual machines and benchmarks how many of each size of virtual machine a particular physical server configuration can host atop a particular hypervisor. Then the cloud controller is instructed to put no more VMs of that size on a particular host server, but that does not guarantee a particular level of performance.

"We have seen studies that show a variation of between 17 and 40 percent for VMs within the same region inside of Amazon Web Services over the course of a week," Cox explains. "Some of this is processor generation differences within the regions and some of it is noisy neighbors on the servers. Amazon doesn't even try to commit to a performance level – that is not part of its value proposition. But if you go into an enterprise and you can't get an SLA, that is a problem."

As it turns out, Intel has lots of telemetry coming out of the processors that it can collect and monitor to help identify the noisy virtual machines within a particular server socket. On the current generation of "Ivy Bridge" Xeons, Intel can detect when there is a noisy VM on the socket, and it can do the same on three prior generations of Xeons (that is "Nehalem," "Westmere," and "Sandy Bridge" if you are playing code-name Bingo with me). The circuitry is not there yet that will allow Intel's software to use the telemetry to identify which VM is misbehaving. But with the "Haswell" Xeons coming this fall, Intel has figured out a way to look at cache contention in the processors to figure out which VM is eating more resources than is should within a processor socket. And in the generation beyond that – Cox didn't name the chip family, but it is the "Skylake" Xeons – Intel will be able to put a sandbox around a VM and contain its performance by associating a specific number of cache lines per VM through a set of registers in the chip.

To be able to guarantee performance, you have to be able to set a performance level.

"Most clouds talk about virtual CPUs, but one vCPU on Haswell is very different from one vCPU on Westmere," says Cox. "So we had to figure out a metric that is consistent across those generations and will be consistent going forward."

To that end, Intel has designated two billion instructions per second – or two GIPS, and no, you cannot make the good stuff up – as what it calls a service compute unit, or SCU. This is the base unit of cloud performance that the DCM:SAA software thinks in. The capacity of a physical system is mostly a function of the instructions per clock (IPC) that the architecture can deliver, since clock speeds have largely stalled for the past several years. Over the past several generations, the IPC has, on average, doubled, but the clock speeds have not. The core counts per processor also drive performance per socket, and therefore per system, up as well, and these have gone up from four cores in the Nehalem generation to twelve cores in the Ivy Bridge generation for the Xeon E5s to fifteen cores in the Ivy Bridge Xeon E7s. (We are hoping to get a chart showing the GIPS per processor for generations of Xeon chips, but Intel did not have it ready at press time.) A recent two-socket Dell server – we are not sure what generation of chip was in it, or what clock speed – was rated at 146 SCUs, or 292 GIPS if you want to think of it that way.

The DCM:SAA software has some low-level telemetry tracking and counting software written in C++ that talks up to an agent for the KVM hypervisor that is written in Python, just like the rest of the OpenStack cloud controller is. There is also an extension to the Nova compute scheduler in OpenStack, and it is a filter driver that takes requests for the placement of VMs on a cloud. The DCM:SAA software has a contention score for each socket – a measure of how noisy each socket's VMs are – and based on the workload that is going to be deployed, it reckons whether or not the new VM will play well with the noisy socket. Intel keeps a time-series performance analysts of all of the sockets and their VMs in a Graphite real-time database. This database is, for the early testers of the software, as important as the sandboxing of VMs.

"If you have an application in a shared environment, and it is behaving poorly, most often it has done something unexpected. But you get no help from the infrastructure, no clues. Folks want to know what changed."

The DCM:SAA software will be used to monitor and manage compute capacity on virtualized servers first. Cox says that quality of service for storage and networking will come next, and that it will probably add quality of service controls for memory bandwidth further down the line. "Most cloud apps that we have tested never come anywhere near saturating the memory channels," says Cox. "But HPC applications, absolutely they do."

Intel has not open sourced the DCM:SAA code, but parts of it have been put into the upstream OpenStack and KVM schedulers. Intel has prototyped the code running in conjunction with Xen, which is important for the big cloud operators – Amazon Web Services, Rackspace Hosting, and IBM SoftLayer all use a variant of Xen. Intel is not talking about when DCM:SAA software will be available for Xen, though. (Google Compute Engine uses KVM as its server slicer, and Microsoft Azure uses Hyper-V.) Intel is not saying when and if support for Hyper-V or VMware's ESXi hypervisor might appear, but both seem logical to assume. Intel started with KVM because the combination of OpenStack and KVM seems to be appealing to enterprise customers, even if many of the big clouds are still using Xen. The DCM:SAA console plugs into the OpenStack console, and has the usual RESTful APIs to be able to use it programmatically instead of by pointing and clicking.

To test the DCM:SAA software, Intel had to come up with a noisy VM, and video transcoding is a very noisy workload. They bring bits of data into memory, do a little work on them, and then get more data, and as such, the L2 and L3 caches tend to get in the way, as Cox put it. When you run other workloads beside them, you might see a 4 or 5 percent performance hit, which is not a big deal, but in situations where there is memory contention, you might see a 25 percent performance hit because of the noisiness of the video transcoding apps. On a series of UnixBench tests run in the lab across four different generations of hardware, the DCM:SAA tool was able to keep the performance of UnixBench allocated to VMs with a set GIPS rating to within 3 percent of each other.

We are hoping to get some data on these benchmarks as well as a GIPS table for various Intel CPUs, and if we can lay our hands on it, we will bootnote this story with that information.