Advanced Computing in the Age of AI | Monday, May 27, 2024

IBM Goes All-In With Apache Spark 

IBM has jumped on the Apache Spark bandwagon, revealing it would throw its considerable weight behind the open source big data project that has been gaining momentum over the last year.

IBM said Monday (June 15) it would integrate Spark software into the "core" of its analytics and commerce platforms while offering Apache Spark as a service on its Bluemix cloud application development platform.

Along with advancing Spark's machine learning capabilities through collaboration with Databricks, the company behind the versatile in-memory analytics framework, the IBM also said it would open a Spark Technology Center in San Francisco while committing more than 3,500 developers and researchers to focus on Spark-related projects.

Backing for Apache Spark also includes the donation of IBM's SystemML machine learning technology to the Spark open source project. IBM also said it would leverage current partnerships to train as many as 1 million data scientist and engineers on Apache Spark.

It also plans to host Spark applications on its Power and Z Systems infrastructure.

IBM's full-throated endorsement of Apache Spark reflects the growing momentum of what has emerged as Hadoop's most popular open-source projects. Last fall, Hortonworks outlined a similar investment in Spark aimed at moving the platform to the enterprise.

In a statement, IBM said it is fully committed to Spark as a foundational technology platform for accelerating innovation and driving analytics across every business in a fundamental way."

Developed by AMPLab ("Algorithms, Machine, People") at the University of California at Berkeley in 2009, Spark was released by startup Databricks in 2013. It is described as a general-purpose data processing engine packaged to handle SQL queries and advanced analytics like machine learning. The cluster-computing framework with in-memory processing quickly gained traction in the analytics market, with hyper-scale deployments by Internet giants like Yahoo and Baidu.

Sparks' creators said their intent was to forge a new generation of analytics tools to derive insights from heterogeneous data by combining machine learning, hyper-scale computing and "human computation."

IBM said its data scientists would begin working over the next few months with Apache Spark open-source community to advance machine-learning capabilities. The initial goal is development of "smart business apps," the company said.

As part of its plan to integrate Spark into its analytics and consumer platforms, IBM said it would begin offering a beta version of its "Spark-as-a-Service" on its Bluemix cloud platform.

In a blog post, Fred Reiss of IBM's Spark Technology Center said several hundred data scientists, developers and designers would begin working at the San Francisco center over the next several months. The center was formed to speed IBM's adoption of new Spark technologies. For example, it integrated an earlier version of Spark (version 1.3.1) to IBM's Open Platform for Apache Hadoop.

IBM said developers have been steadily reducing Spark's backlog of bug fixes while working to improve its performance. Reiss said the next step would be contributing new features and components to Apache Spark, with special emphasis on machine learning as the company shifts its technology to the open-source community.

It also expects to begin demonstrating business applications based on Spark in the coming weeks.

The company said more than 300 IBM engineers are already working on Hadoop and Spark open source development efforts.

The Internet of Things is a likely target for Spark-based applications. IBM said it has already built an IoT application for urban traffic planning. The application uses Spark to process cellular data and then visualizes the analytics in real time.

IBM said it has also launched genomics and disaster relief efforts based on Apache Spark.

About the author: George Leopold

George Leopold has written about science and technology for more than 30 years, focusing on electronics and aerospace technology. He previously served as executive editor of Electronic Engineering Times. Leopold is the author of "Calculated Risk: The Supersonic Life and Times of Gus Grissom" (Purdue University Press, 2016).