Advanced Computing in the Age of AI | Saturday, December 3, 2022

IBM Goes All-In With Apache Spark 

IBM has jumped on the Apache Spark bandwagon, revealing it would throw its considerable weight behind the open source big data project that has been gaining momentum over the last year.

IBM said Monday (June 15) it would integrate Spark software into the "core" of its analytics and commerce platforms while offering Apache Spark as a service on its Bluemix cloud application development platform.

Along with advancing Spark's machine learning capabilities through collaboration with Databricks, the company behind the versatile in-memory analytics framework, the IBM also said it would open a Spark Technology Center in San Francisco while committing more than 3,500 developers and researchers to focus on Spark-related projects.

Backing for Apache Spark also includes the donation of IBM's SystemML machine learning technology to the Spark open source project. IBM also said it would leverage current partnerships to train as many as 1 million data scientist and engineers on Apache Spark.

It also plans to host Spark applications on its Power and Z Systems infrastructure.

IBM's full-throated endorsement of Apache Spark reflects the growing momentum of what has emerged as Hadoop's most popular open-source projects. Last fall, Hortonworks outlined a similar investment in Spark aimed at moving the platform to the enterprise.

In a statement, IBM said it is fully committed to Spark as a foundational technology platform for accelerating innovation and driving analytics across every business in a fundamental way."

Developed by AMPLab ("Algorithms, Machine, People") at the University of California at Berkeley in 2009, Spark was released by startup Databricks in 2013. It is described as a general-purpose data processing engine packaged to handle SQL queries and advanced analytics like machine learning. The cluster-computing framework with in-memory processing quickly gained traction in the analytics market, with hyper-scale deployments by Internet giants like Yahoo and Baidu.

Sparks' creators said their intent was to forge a new generation of analytics tools to derive insights from heterogeneous data by combining machine learning, hyper-scale computing and "human computation."

IBM said its data scientists would begin working over the next few months with Apache Spark open-source community to advance machine-learning capabilities. The initial goal is development of "smart business apps," the company said.

As part of its plan to integrate Spark into its analytics and consumer platforms, IBM said it would begin offering a beta version of its "Spark-as-a-Service" on its Bluemix cloud platform.

In a blog post, Fred Reiss of IBM's Spark Technology Center said several hundred data scientists, developers and designers would begin working at the San Francisco center over the next several months. The center was formed to speed IBM's adoption of new Spark technologies. For example, it integrated an earlier version of Spark (version 1.3.1) to IBM's Open Platform for Apache Hadoop.

IBM said developers have been steadily reducing Spark's backlog of bug fixes while working to improve its performance. Reiss said the next step would be contributing new features and components to Apache Spark, with special emphasis on machine learning as the company shifts its technology to the open-source community.

It also expects to begin demonstrating business applications based on Spark in the coming weeks.

The company said more than 300 IBM engineers are already working on Hadoop and Spark open source development efforts.

The Internet of Things is a likely target for Spark-based applications. IBM said it has already built an IoT application for urban traffic planning. The application uses Spark to process cellular data and then visualizes the analytics in real time.

IBM said it has also launched genomics and disaster relief efforts based on Apache Spark.

About the author: George Leopold

George Leopold has written about science and technology for more than 30 years, focusing on electronics and aerospace technology. He previously served as executive editor of Electronic Engineering Times. Leopold is the author of "Calculated Risk: The Supersonic Life and Times of Gus Grissom" (Purdue University Press, 2016).

One Response to IBM Goes All-In With Apache Spark

  1. Ily Geller says:

    INM does the mistake: Apache Spark uses SQL, which is obsolete.

    For the past 70 years SQL (generic name for whatever IBM has done) dominated search for electronic information. It’s external to data technology, which helps to distill patterns and statistics based on queries, from outside to data, externally. SQL technology emanates from External Relations theory of Analytic Philosophy: students of Moore, Russell and Wittgenstein established IBM and everybody followed their path.
    However, there is Internal Relations theory, which is based on Bradley, Poincare and my ideas. In this theory patterns and statistics are found into structured data.
    I discovered and patented how to structure any data: Language has its own Internal parsing, indexing and statistics. For instance, there are two sentences:

    a) ‘Sam!’
    b) ‘A loud ringing of one of the bells was followed by the appearance of a smart chambermaid in the upper sleeping gallery, who, after tapping at one of the doors, and receiving a request from within, called over the balustrades -‘Sam!’.’

    Evidently, that the ‘Sam’ has different importance into both sentences, in regard to extra information in both. This distinction is reflected as the phrases, which contain ‘Sam’, weights: the first has 1, the second – 0.08; the greater weight signifies stronger emotional ‘acuteness’.
    First you need to parse obtaining phrases from clauses, restoring omitted words, for sentences and paragraphs.
    Next, you calculate Internal statistics, weights; where the weight refers to the frequency that a phrase occurs in relation to other phrases.
    After that data is indexed by common dictionary, like Webster, and annotated by subtexts.
    This is a small sample of the structured data:
    this – signify – : 333333
    both – are – once : 333333
    confusion – signify – : 333321
    speaking – done – once : 333112
    speaking – was – both : 333109
    place – is – in : 250000
    To see the validity of technology – pick up any sentence.

    Do you have a pencil?
    Do you see numbers on the right? These are weights. Internal weights.
    My technology on Internal Relations has no gaps as External on queries has – it describes ALL data.

Add a Comment