Advanced Computing in the Age of AI | Friday, March 29, 2024

Hadoop Speed Up From Lustre Will Attract Enterprises 

Intel wants Hadoop analytics and the Lustre high-performance file system to work well together for enterprises. The idea is to ease the integration woes of customers who run both simulations and analytics on the same data, such as those in the financial services and oil and gas industries, and to boost the performance of Hadoop at the same time.

Back in June, Intel announced the integration of its versions of Hadoop and the Lustre file system. This software started shipping in late August. EnterpriseTech checked in with Intel's Data Center Software Division to see what benefits customers can expect from the combination of the two pieces of software and what early adopter customers are most likely to give the Hadoop-Lustre stack a whirl on their clusters – and why.

There are a number of reasons why enterprises should consider running Hadoop alongside of or on top of Lustre, explains Girish Juneja, general manager of big data software and services at Intel and also the CTO of the company's Data Center Software division.

The first reason – and always the big one with enterprise customers – is performance.

"With the Lustre file system you have to ability to use a very large and very fast file system, which was designed for high performance compute environments," says Juneja. "And that performance element, especially when you go into the shuffle phase of Hadoop, where you are moving files around, that is where you see that using a file system like Lustre, with fast file I/O, tends to accelerate your overall performance."

A number of big supercomputing labs are putting the Hadoop-Lustre combination through the benchmark paces at the moment, and they expect to announce their results along with Intel at the SC13 supercomputing conference in Denver in November. But Juneja is the sporting type, and said that the performance improvements on Hadoop jobs would be significant. We are not talking about a few tens of percents here, but much larger bumps. He cautioned that not all of the benchmark data is in yet. "Hadoop use models are so varied, I don't want to be trite by throwing out a number," Juneja said.

So we are going to have to wait a bit longer for the benchmark data.

The other reason to slide Lustre underneath Hadoop is that it is a POSIX-complaint file system, which means you can mount it with NFS and other file systems and either access the data that Hadoop creates directly and stores in Lustre (if you run it that way) or you can move data from other file systems into Lustre so Hadoop can see it.

The initial use cases for the Hadoop-Lustre combination are right where EnterpriseTech expects them: the supercomputing labs that already have Lustre feeding their simulation clusters and their enterprise analogs in the oil and gas sector or in big pharma that also have Lustre clusters because of their need for high performance.

"In either case, you are primarily running a simulation cluster, but with a typical 80-20 spread," says Juneja. "About 20 percent of the time you need to run analytics workloads on the cluster using the datasets you already have collected in the Lustre file system. So how do you do it? Today's options are to setup a separate cluster using HDFS and move the data around to do that analysis, or if you use the Intel Hadoop distribution on top of Lustre, you can do the analytics right on top of that Lustre file system and you don't have to set up a separate cluster and bear the cost of managing and maintaining it."

One of the drivers of the Intel Hadoop-Lustre combo is Cray, which partnered with Intel back in February to make Intel's Hadoop distribution available on top of its CS300 clusters and Sonexion storage arrays.

"Cray obviously works with both the university and HPC environments, but also they are pushing more and more into the high-end enterprise HPC environment. That is where we are seeing some interesting customer interest in using the combination of Hadoop and Lustre as well. Our target is not just labs, but the high-end enterprise HPC customers and then, over time, it will bleed into more mainline use of Lustre for enterprise storage above and beyond the enterprise HPC environment."

The early commercial targets for the Hadoop-Lustre combination are the usual suspects. Financial services firms do a lot of simulation, such as Monte Carlo and other options pricing simulations, on HPC clusters and they want to be able to do other analytics on that data. Ditto for oil and gas companies, which want to do analytics on their seismic data separate from the simulations themselves. Manufacturers will also see some benefits, said Juneja, and he gave the example of an airplane manufacturer doing analytics on a cluster that houses all of the design information for the aircraft.

"Think about the number of data points you are collecting in that simulation," explains Juneja. "There are hundreds of variables that you are collecting in that entire simulation at a very high speed that you are storing in a high-performance file system like Lustre. If you want to run a simple kind of warehouse-type query on that data set, how would you do it using standalone Hadoop? You can't do it unless you move that data off Lustre. It is as simple as that. And if you want to run more complex queries and run a machine learning algorithm on top of that data set using Mahout to predict machine failure, that would be another example of running that Hadoop analytics on top of the data, sitting where it is inside Lustre rather than having to move it."

By the way, Intel doesn't expect to charge customers an arm and a leg to mix Hadoop and Lustre together, and has come up with a pricing scheme that reflects usage.

"Intel's objective is to make sure that the usage of these two technologies expand," says Juneja. "Our intent is not to maximize price, but expand usage models. So when we combine Hadoop and Lustre, we charge a small premium above baseline costs."

So, if you are primarily a Lustre shop and you want to do some Hadoop processing on your data, the price will be based on the mix of workloads running on the cluster, not on the number of nodes multiplied by the list price per node for Hadoop and Lustre sold separately. Conversely, if you are a Hadoop shop and you have HDFS underpinning your cluster buy have some Lustre nodes off to the side to speed up that shuffle phase between mapping and reducing, then you will pay a small premium over and above the cost of the Hadoop support licenses.

It is going to take some time for Hadoop and Lustre to take off together in among large enterprises, Juneja says, but Intel is in it for the long haul. Intel has some of the largest Lustre customers in the world, and has a few very large Hadoop shops as well, he says. But the combination of the two is just getting rolling among enterprise users, and will take time. Not just because Intel is working through partners like Cray, who peddle the systems and storage to those shops, instead of selling directly. But because enterprises are more risk adverse and take more time to absorb new technology.

Intel has had its own Lustre file system distribution since acquiring Whamcloud back in June 2012. The latest release, called Intel Enterprise Edition for Lustre, has plug-ins that make Lustre appear to be the native HDFS file system that Hadoop expects even though it is not. What that means is that you can run Hadoop on the same physical clusters that make up a Lustre clustered file system.

Intel it rolled out its own variant of the Hadoop big data muncher since February of this year after shipping two releases quietly to customers in China for nearly two years. Intel's Hadoop distribution is based on the Apache Hadoop 2.0 and includes the Yarn MapReduce v2 distributed processing framework as well as the HDFS 2.0.3 distributed file system. Intel also grabs the HBase columnar data store and Hive SQL query tools that ride on top of the batch processing system from Apache. Intel has tweaked a bunch of the other parts of the Hadoop stack, including R connectors for statistical analysis of data stored in HDFS, Mahout machine learning, Pig scripting, Oozie workflow management, Flume log collection, Sqoop data exchange, and Zookeeper configuration management. Intel has contributed all of its changes to these Hadoop-related projects back to the open source community, and it has also created its own Intel Manager for Apache Hadoop controller to set deploy, configure, and monitor a Hadoop cluster. This management tool is not open source.

One last thing that Juneja revealed. Intel has created a plug-in for the SLURM job scheduler commonly used on HPC clusters that lets it reach into the Yarn scheduler for Hadoop. Up until now, if you had a mixed HPC simulation and Hadoop cluster, you would have to manually manage taking jobs offline in one scheduler and starting new ones up in the other. Now, you can have SLURM tell Yarn what to do and when to do it on a hybrid cluster. Or, any other HPC job scheduler, for that matter.

"The idea is that the framework is extensible and it is open," says Juneja. "So if you have your own favorite HPC scheduler, you can build an easy extension from Yarn into it so you have one mechanism of control."

The SLURM integration with Yarn is also expected to debut at the SC13 show.

EnterpriseAI