Advanced Computing in the Age of AI | Monday, July 15, 2024

Lustre Running Better Than Expected On AWS 

A few weeks ago, Intel made its variant of the Lustre file system, popular at supercomputing centers and seeing more and more uses among enterprises that need zippy access to files, available on the Amazon Web Services Market Place. This Intel Cloud Edition for Lustre is performing a bit better than expected on the Amazon cloud, and even Intel is pleasantly puzzled as to why.

Putting Lustre up on the AWS cloud is one way to try to get early adopters to give it a whirl and perhaps eventually use it for application development and then production, either in their own datacenters or on AWS. Many AWS customers use Amazon's public cloud for development but still run their key applications back on their own iron. The point is, customers can now spin up a scale-out, POSIX-compliant file system with a couple of mouse clicks and a credit card number.

"The interest in Lustre on AWS has been significant, and better than I had hoped for," Brent Gorda, general manager of the high performance data division at Intel, tells EnterpriseTech. "The efficiency and scalability of Lustre will give you performance numbers much better than you would imagine. The one interesting figure that I like to trot out is that a year ago, before the multiple metadata server work we had done, any Lustre file system would peak out at 20,000 file creates per second. This is one of the areas that Lustre users, especially at the high end, had complained about. A few months ago, we demonstrated 160,000 file creates per second on AWS on their plain vanilla hardware – there was nothing special about it. On applications that are heavy on metadata, you can run faster on AWS than you could have on any file system last year. And on dedicated hardware in your own datacenter, you can run even faster because you can control that environment and maybe use SSDs."

In fact, says Gorda, one of the big Lustre users in Europe (who wishes to remain anonymous) is putting flash storage into its Lustre metadata servers to try to push the performance up even higher. Pharmaceutical and media and entertainment companies are the first tire kickers for using Lustre on AWS before Intel made its announcement. Financial services firms, who are often on the cutting edge of technology, are also looking increasingly to Lustre to speed up their systems and the applications that run atop them, but they tend to do this on their own systems. Intel is working with the OpenStack community to make Lustre a file system option for clouds based on that cloud controller, too.

intel-lustre-awsIntel has already done such integration work for AWS, which has its own cloud controller (called CloudFormation) and which automatically fires up metadata servers and object storage servers for a Lustre cluster atop EC2. Intel had to do a lot of work to orchestrate how the servers are loaded up on EC2, but Gorda says that it is still standard Lustre under the covers. In this case, it is Intel Enterprise Edition for Lustre 1.01, which is supported atop Red Hat Enterprise Linux 6.4 or its clone, CentOS 6.4. The Amazon Machine Image that Intel has wrapped up specifically for EC2 puts Lustre on CentOS and includes the Luster Monitoring Tool and the open source Ganglia system monitor. There is a community edition of the service that is free, but you still have to pay for EC2 and EBS services. If you want Intel tech support, you have to pay extra for that.

As you can see from the AWS Market Place price list, on an M3 Extra Large EC2 instance, which has four virtual CPUs and 15 GB of virtual memory, EC2 costs 45 cents per hour and the Lustre software and support costs an additional 8 cents per hour. For an M3 Double Extra Large instance, which has twice the vCPUs and vRAM, the EC2 instance costs 90 cents per hour, but the Lustre support costs the same 8 cents on the half-sized instance.

In some cases, Lustre is performing better than expected on AWS, and Gorda and his team are still puzzling out why.

"Amazon is very secretive about what hardware it is using," Gorda says, and this is still true even though the company has been giving a few details here and there about underlying processors so customers can tune their applications better. "They don't actually tell you what the networking is, for instance. You just sign up for CPU capacity and performance. We have been finding instances where we are running Lustre faster than we would expect. This is an interesting signal that they have some good hardware behind EC2."

Gorda says that Intel figured Amazon had 10 Gb/sec Ethernet links between its EC2 server instances, and speculates that Amazon could be dual-bonding ports coming out of the physical servers to boost the bandwidth. It is possible that Amazon is using InfiniBand links, too. "What I can tell you is that we have seen performance that is better than one 10 Gb/sec Ethernet port can do," says Gorda.

On the AWS cloud, files are stored on the Elastic Block Storage (EBS) service, not on local disks inside the physical server, as is the case with a real Lustre cluster. The object storage server looks like it has local disk as far as Lustre is concerned, but it is really reaching out to the EBS service.

This is an important distinction between physical and AWS Lustre file systems. Because the object storage server doesn't really have the disks, you can migrate data from the file system (in EBS) to the S3 object store on the Amazon cloud. Moreover, you could also hibernate the entire Lustre file system, fire up a different number of object storage server nodes to boost performance, and "rehydrate" the file system, as Gorda put it. This latter bit is something that Intel's software engineers are working on now. This would allow users to tailor performance and capacity.

Lustre and Hadoop Mashup

In a related development, Intel is working with early adopters to put the combination of its versions of the Lustre file system and the Hadoop analytics platform through the paces. This so-called HPC Edition of Hadoop is not restricted to traditional high performance computing shops and, given the benefits of the combination of the two pieces of software, should see uptake among enterprise customers who are looking for something a bit faster than the Hadoop Distributed File System that was created to underpin the MapReduce layer in Hadoop.

"We are inviting beta users to get some testing at scale and to address some of the interest in benchmarking. The reality is that we do not have systems that are large enough to benchmark at the altitudes that we would like to do. We have users who are running on a hundred nodes, that kind of thing. We want to do bigger configurations of the combination to see how it will perform. Of course, on the Lustre side, I am not that concerned. Lustre is reasonably well known at high altitudes, and is very resilient as well. Lustre went through a trough of disillusionment back in 2005 and 2006, but it has staged quite a comeback in the last few years. There's less /temp and more /home."