Advanced Computing in the Age of AI | Tuesday, July 23, 2024

Lifting the Fog of Spark Adoption 


If you’re stymied by the prospect of deploying Spark for the first time, Craig Lukasic, senior solution architect at Zaloni, offers tips and insights for overcoming some of the common misperceptions and barriers to successful Spark implementations. Below is an excerpt from an article that appears in full here at our sister publication Datanami.

Clients are often confused about Apache Spark, and this confusion sometimes hinders its adoption. The confusion is not about the features of Spark per se, but about installing and running the big data framework.

One client was convinced that they needed MapR M5 to even make use of Spark and they were really confused on how it runs on the cluster, thinking multiple Spark jobs interacted directly. To help illustrate the flexibility with deploying Spark, I explained the following to the client.

First, consider two users. One has been working with “SparkSQL” via the spark-sql command line for some time, but the other user wants to use the latest MLLib features and submit Spark jobs to execute on the cluster.
The first user is a business analyst who’s knowledgeable in SQL, and wants to do sanity checks on the data. Spark is “installed” at opt/sparks/spark-1.3.1, on the edge node of a Hadoop cluster, and he is happy with the older version of Spark (version 1.3.1), because he just wants to write SQL and get faster results than what Hive with MapReduce would provide. He runs the spark-sql script that resides in the Spark installation’s bin directory. The user would get a command-line interface similar to Hive’s CLI and would run SQL queries.

The second user is a savvy data scientist looking for the newest way to get insights out of the data. This user wants the latest, greatest machine learning library in Spark version 1.6.0 to run fancy statistical models. Again, Spark is “installed” opt/sparks/spark-1.6.0, on the edge node. The data scientist writes Scala and compiles code into jars that are submitted to the cluster and run for some time. She submits her job to the YARN cluster using the spark-submit script…

For the rest of the article, please go here