Advanced Computing in the Age of AI | Tuesday, July 23, 2024

Spreading Spark Enterprise-wide 

Spark is in in the spotlight. Companies with big data analytics needs are increasingly looking at the open source framework for lightning quick in-memory performance – reputedly up to 100X faster than Hadoop MapReduce (according to As the data tsunami rolls on and quintillion bytes of data are generated every day, Spark is one of the answers to the daunting task of pulling insight and value out of oceanic data sets.

But it’s also often the case that business analysts and data scientists in the enterprise are so eager to get their hands on Spark that they stray off the IT reservation and set up ad hoc Spark clusters, causing resource strains, siloed data, security risks and other management challenges.

The launch of IBM’s Platform Conductor for Spark is intended to keep Spark under the big IT tent, enabling production-ready, IT-approved and manage multiple Spark instances across the enterprise. IBM calls it a hyperconverged, multi-tenant offering that uses Spectrum Scale (formerly GPFS) File Place Optimizer to add the Spark environment to massive data sets.

Nick Werstiuk of IBM

Nick Werstiuk of IBM

“We’re delivering the ability to have a common file system across the nodes in a Spark cluster that provides both GPFS and Posix access to the data,” Nick Werstiuk, product line executive, software defined infrastructure, IBM Systems, told EnterpriseTech. “So it gives clients the ability to move the data in and out of the Spark environment according to data life cycle management needs.”

“Users facing the challenge of running Spark in a production environment need an end-to-end, enterprise-grade management solution,” said Carl Olofson, research vice president, application development and deployment at industry watcher International Data Corp. “IBM has made a major commitment to supporting organizations’ Spark needs, and is offering IBM Platform Conductor for Spark as such a solution.”

IBM Platform Conductor for Spark is the third offering in IBM’s software-defined Platform Conductor shared infrastructure portfolio, the others being Platform LSF, for HPC design, simulation and modeling workloads; and Platform Symphony, for high-performance risk analytics. Platform Conductor sits atop the IBM Platform Computing common resource management layer that can be implemented in a distributed environment on a variety of on-prem (OpenPOWER, x86) hardware platforms or hybrid cloud infrastructure.

IBM said the product aims to achieve faster time-to-results and simplified deployments via lifecycle management capabilities, including resource scheduling, data management, monitoring, alerting, reporting and diagnostics, that allow multiple instances of Spark jobs to run on resources that would otherwise be idle.

“We see Spark as one of the critical new sets of workloads evolving out of the Big Data ecosystem, the next foundational middleware for Big Data analytics,” Westiuk said. “Our vision is for these workloads coming together on a shared infrastructure. Here’s a set of software capabilities you can deploy on your own hardware – scale-out x86 or PowerLinux environments – and essentially start up an all-inclusive Spark infrastructure for a multi-tenant, shared service Spark capability to multiple users or data scientists or lines of business within an enterprise.”