Advanced Computing in the Age of AI | Saturday, September 23, 2023

It’s 3 a.m. — Do You Know What Your Cluster’s Doing? 

Performance challenges in Hadoop environments are par for the course as organizations attempt to capture the benefits of big data. The growing ecosystem of tools and applications (including new analytics platforms) for Hadoop is becoming increasingly distributed, making the challenges of using Hadoop in production exponentially more complicated. Which begs the question, do you know how your cluster will react to this kind of growth and change in usage? It might be operating just fine today, but what happens two months from now once hundreds of new workloads have been added?

Distributed systems make optimal performance a particularly hard goal to achieve. When your cluster has hundreds of nodes, and each node has dozens of jobs running independently and using up CPU, RAM, disk and network — and the levels of resource consumption are constantly changing – it quickly becomes an extremely chaotic system. Businesses need a way to bring order to the distributed systems chaos, especially when usages are constantly in flux and dynamically changing, and therefore impact performance. So where does one start?

Finding a way to measure Hadoop performance is the first step, and the best way to do that is by establishing Quality of Service (QoS) for Hadoop. QoS provides the ability to ensure performance service levels for Hadoop applications by enabling prioritization of critical jobs. It means that multiple jobs can run side-by-side, safely and effectively, since bottlenecks and contention can be a averted. The types of workloads that typically take priority on the cluster include transactional systems like HBase, data ingest/ETL jobs that carry strict service level agreements (SLAs), and daily runs of critical data products such as direct mailings sent to customers. These types of jobs take priority because they need to maintain consistent performance in the face of other less-critical jobs like ad hoc or analytic workloads.

Today’s multi-tenant, multi-workload clusters face major performance issues caused by the fundamental limitations of Hadoop, especially on large scale big data implementations, because after a certain threshold contention is inevitable. The implications of these problems are massive. From a business perspective, companies are wasting time and money trying to fix cluster performance issues that prevent them from realizing the full benefits and ROI of their big data efforts, or gaining any sort of competitive advantage linked to big data initiatives. On the other hand, unreliable Hadoop performance has important technological implications such as late jobs, missed SLAs, overbuilt clusters, and underutilized hardware.

Workarounds don’t work

Organizations without QoS for Hadoop inevitably face problems with resource contention, missed deadlines, and sluggish cluster performance — and ultimately wind up spending large amounts of time trying to identify rogue users and jobs, as well as which hardware resources are being most heavily used. Once the cluster has hit a performance wall, Hadoop admins are pressured to find a resolution. Yet current best practices often fall short. Buying additional nodes when hardware utilization is well below 100% is an expensive effort that can alleviate performance symptoms — but doesn’t address the fundamental performance limitations of Hadoop, and so is only a temporary stop gap. Cluster isolation, similarly, is expensive and doubles environment complexity — and it isn’t a viable solution if you want to protect more than two workloads. Finally, tuning can’t solve performance problems because, by definition, tuning is always a response to problems that have already occurred (it’s like trying to drive a car by only looking in the rear-view mirror), and also because there is no way a human can make the thousands of decisions a second necessary to tweak settings to adjust to the ever-changing conditions of a cluster. Ultimately, these workarounds — overprovisioning, siloing, and tuning — that businesses currently rely on don’t work long term, are very expensive, and ultimately create unnecessary overhead.

Mike Matchett, senior analyst at Taneja Group, notes: “The real trick is performance management, the key to which is knowing who’s doing what, and when. At a minimum, there are standard tools that can generate reports out of the (often prodigious) log files collected across a cluster. But this approach gets harder as log files grow. And when it comes to operational performance, what you really need is to optimize QoS and runtimes for mixed-tenant and mixed-workload environments.”

Many companies are currently taking a bunch of workloads and throwing them at a bundle of servers, hoping for the best, but soon they discover that workarounds — in reality — are just more work. We already have complicated frameworks like YARN (Hadoop 2) placing performance pressure on systems, and if you look into the future at new compute platforms like Mesos, OpenStack and Docker, they will all run into this same class of performance problem eventually. Enterprises need to quickly get ahead of this broadly applicable problem if they want to make the most out of their distributed computing investments. By implementing a system that prioritizes QoS for Hadoop, organizations can be more productive overall and focus on things that matter for the business, rather than wasting time and money trying to troubleshoot and tune cluster performance to no avail.

--Sean Suchter is cofounder and CEO of Pepperdata

About the author: George Leopold

George Leopold has written about science and technology for more than 30 years, focusing on electronics and aerospace technology. He previously served as executive editor of Electronic Engineering Times. Leopold is the author of "Calculated Risk: The Supersonic Life and Times of Gus Grissom" (Purdue University Press, 2016).