Advanced Computing in the Age of AI | Wednesday, June 19, 2024

IBM Shifts Spark Development to its Cloud 

IBM upped its investment in Apache Spark this week via a cloud-based development platform that would expand data science and developer access to various flavors of Spark, including a machine-learning version.

IBM (NYSE: IBM) said Tuesday (June 7) it is expanding access to the data analytics development tools available on its Bluemix cloud platform, giving data scientists working in the R programming language faster access to more data along with new contributions to SparkR, SparkSQL and Apache SparkML.

IBM, which last year announced a $300 million investment in the Spark real-time data analysis platform as an emerging "analytics operating system," said it cloud-based development environment would speed data ingestion and analysis by combining resources from IBM and others on a managed Spark environment.

"We see an opportunity to significantly transform the role of the data scientist by providing access to curated data sets, open source tools and a collaborative platform to accelerate innovation," Bob Picciano, senior vice president of IBM Analytics, noted in a statement.

IBM announced last June it would integrate Spark software into the "core" of its analytics and commerce platforms while offering Apache Spark as a service on its Bluemix cloud application development platform.

Along with advancing Spark's machine learning capabilities through collaboration with Databricks, the company behind the in-memory analytics framework, IBM also said it would open a Spark Technology Center while committing more than 3,500 developers and researchers to focus on Spark-related projects.

The company said this week it has built Spark into Watson, commerce, analytics, systems and cloud platforms as well as its Apache Hadoop and Spark platforms. It also turned over its SystemML machine learning to the open source community last year as another way to boost Spark’s machine learning capabilities.

IBM also is billing its foray into Spark development as a way to push data science into the mainstream as it did with computer science with the introduction of the PC in the early 1980s. "With data science, the major roadblock is having access to large data sets and having the ability to work with so much data," Picciano noted.

The cloud-based development environment for Spark is positioned as a means of expanding enterprise use of the technology while making it easier to develop business and scientific applications that harness data analytics insights.

Since announcing its investment in Spark, IBM said it has begun working with a range of commercial and scientific customers to help them get a handle on huge datasets. For example, it is collaborating with NASA and the SETI Institute to analyze more than six terabytes of deep space radio signals to hunt for patterns that might identify the presence of intelligent extraterrestrial life.

SETI is using IBM analytics running on Apache Spark for an initiative designed to search for potential communications between planets that might be orbiting in double star systems. The partners said by extracting new features from millions of observations, researchers can use machine-learning techniques to classify signals. Once identified, researchers can then zero in on clusters of anomalous signals for further analysis, IBM said.

About the author: George Leopold

George Leopold has written about science and technology for more than 30 years, focusing on electronics and aerospace technology. He previously served as executive editor of Electronic Engineering Times. Leopold is the author of "Calculated Risk: The Supersonic Life and Times of Gus Grissom" (Purdue University Press, 2016).