Advanced Computing in the Age of AI | Saturday, September 23, 2023

Cluster Sizes Reveal Hadoop Maturity Curve 

If you want to get a rough sense of how mature a Hadoop installation is, all you need to do is count the server nodes.

The largest Hadoop clusters in the world are at Internet companies and have maybe 5,000 nodes in them today, according to Mike Olson, chairman and chief strategy officer at Cloudera, the largest commercial Hadoop distributor. He told EnterpriseTechin a recent interview that these clusters have grown from around 3,500 nodes last year, so they are not quite doubling in size. That is because disk drives are fatter and more cores are on X86 processors. That means aggregate storage and compute capacity is probably still growing faster than the node count in many cases.

As other industries are adopting Hadoop, they seem to follow a similar pattern as the early Internet companies as they ramped up their use of the system for storing and processing data four or five years ago. EnterpriseTech talked to two other Hadoop distributors, MapR Technologies and Pivotal, to get a read on the cluster configurations in industries outside of the Internet giants.

Companies start out slow with Hadoop for a number of different reasons, and not just because they are conservative by nature.

"The adoption of Hadoop is driven by the availability of use cases for that vertical," explains Susheel Kaushik, senior director of technical product marketing at Pivotal. "The larger the number of publicized use cases, the larger the adoption of Hadoop within the business. If there are few known use cases, then customers are a little tentative in making significant investments in Hadoop. The usual challenge is the lack of creativity among the business users to understand how to leverage the big data platforms within their business, as well as their inherent nature of being a technology follower instead of an adoption leader."

Kaushik says that the average Hadoop cluster size reflects follows a fairly predictable curve. "Our observation is that companies typically experiment with cluster size of under 100 nodes and expand to 200 or more nodes in the production stages. Some of the advanced adopters cluster sizes are over 1,000 nodes."

The size of the Hadoop cluster is often driven more by the storage capacity and I/O required for the applications riding adopt it than in the compute capacity in the processors inside the box. "We expect that considering that most of the Hadoop workload is I/O bound, the sizing will continue to remain storage bound," says Kaushik.

The clusters also grow as the number of use cases within the company rises. "The use cases are getting pretty diverse," says Jack Norris, chief marketing officer at MapR. And that is driving up cluster sizes.

For instance, MapR has one retail customer with 2,000 nodes, with Hadoop driving eight separate applications. One application drills into social media data to try to understand what regional differences exist among its customers to determine what will and will not sell in those regions. Hadoop is also the engine for sentiment analysis. This retailer is also chewing on data to see the differences between in-store behavior and online behavior by shoppers. The retailer tracks customers as they move through the store, seeing where shoppers pause and browse, much as they track customers online, watching what they look at in their Web browsers. This retailer is also using Hadoop as part of a system to offer in-store pickup to compete against other online retailers.

Another MapR customer in the financial services sector has a 1,200-node Hadoop cluster that, among other tasks, is used in fraud detection.

Like Pivotal, MapR sees storage capacity as a key factor in determining the size of the cluster.

"We're seeing close to 50 terabytes per node," says Norris of recent installations. For customers who are deploying on Cisco UCS machines, for instance, Norris says it can get sixteen drives in a node and customers tend to use 3 TB drives. Some customers are using quad-core processors, so in a two socket machine they can get two disks against each core.

In the MapR reference architectures from IBM and Hewlett-Packard, the disk to core ratio is still low. In the System x reference architecture that MapR has put together with IBM, the Hadoop compute nodes have two eight-core Xeon E5 processors, two drives mirrored for the operating system and a dozen for data storage; SATA drives come in 3 TB or 4 TB capacities. The SL4500 hyperscale setup tuned for Hadoop from Hewlett-Packard has three two-socket server nodes, each with eight-core processors, and 45 fat SATA drives in the chassis. That is one disk shy of a one-to-one ratio between cores and disks.

"There is a time element to all of this," says Norris. "What I have been surprised by is the appetite once they prove it works. It tends to expand quite rapidly. The proofs of concept start out with ten or twenty nodes, and then the initial deployment is maybe 50 to 100 nodes. Again, it depends on what they are doing, how much data they need to store. And as the use cases for Hadoop grow, the cluster can grow to 1,000 to 2,000 nodes within a year."

Here's another interesting trend that MapR has noticed: If the Hadoop cluster starts with the IT department, it tends to be a broader deployment, so the growth is faster than if it is a departmental machine. "There is an organic pull across departments, too," says Norris. "So, for instance, at a big bank that implemented fraud detection, the marketing department found out about the Hadoop cluster and started using it because the cluster already had the customer data in it."

If the Hadoop cluster is used as a data warehouse offload, it can also grow very fast. One of the reasons this is happening is because offloading some of the data off the warehouse is relatively easy; you can make all of the reporting people are used to work. The economics plays in favor of Hadoop, which Norris says costs on the order of a few hundred dollars per terabyte to store raw data compared to something like $16,000 per terabyte for a data warehouse based on relational databases.

Cluster size is also a proxy of sorts for the industry in which the Hadoop user is categorized. Yahoo, Facebook, Twitter, and other hyperscale Internet companies were Hadoop innovators, and they have massive amounts of data as well, so you expect them to have large clusters, says Kaushik. Generally speaking, the early adopters in the financial services, telecommunications, and biotech industries are further along in their use of Hadoop than are those in the manufacturing, retail, education, agriculture, and energy. So the former will tend to have larger clusters. There are exceptions to every rule, of course, as the retailer cited by MapR above most certainly is.

The thing to keep in mind is that once Hadoop comes into the IT shop and people figure out how to use it, the cluster size can grow rapidly. This is the sort of thing companies have to plan for rather than be surprised by.