Hello Spark - First Experiments with the Apache Cluster Computing System Spark

TL;DR Heads-up: This post isn't about an introduction to Spark itself. This write-up walks you through the setup of a local Docker based Spark Cluster. In the first section we describe how to setup a Spark standalone Cluster with Docker compose. We will prepare a docker-compose.yml file and fire up the cluster with a single command in the end:

$ docker-compose up

To test the setup we will connect to the running cluster with the Spark Shell (running inside a Docker container, too).

Let's get going - Hello Spark!

Apache Spark™ is a fast and general engine for large-scale data processing.

This post covers the setup of a standalone Spark cluster. In particular our initial setup doesn't include a Hadoop cluster. Some future posts we will probably dive into integrating with other Big Data technologies.

Setting up a standalone Spark Cluster with Docker Compose

Since there seem to be no official prepackaged Spark Docker images available we'll need to build one of our own. We started from wongnai/spark-standalone and did some small changes to reduce the number of files needed.

We stick to the initial idea to use the same image for master and worker nodes.

# inspired by https://hub.docker.com/r/wongnai/spark-standalone/
FROM openjdk:8u141-slim

ARG APACHE_MIRROR_SERVER=http://apache.mirror.digionline.de
ARG SPARK_VERSION=2.1.1
ARG HADOOP_VERSION=2.7

RUN apt-get update \
 && apt-get -y install wget \
 && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /opt \
 && wget -q -O - ${APACHE_MIRROR_SERVER}/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz | tar -xzf - -C /opt \
 && mv /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} /opt/spark

ENV PATH=/opt/spark/bin:/opt/spark/sbin:$PATH

EXPOSE 6066 7077 7078 8080 8081

WORKDIR /opt/spark

ENTRYPOINT ["/opt/spark/bin/spark-class"]
CMD ["org.apache.spark.deploy.master.Master", "--ip spark-master", "--port 7077", "--webui-port 8080"]

A default CMD is provided for convenience. Without additional arguments docker run will fire up a Spark master.

Nothing special required to build the image:

$ docker build -t sandbox/spark .

No surprises in the service sections of docker-compose.yml either. It is pretty straight forward when you look at Installing Spark Standalone to a Cluster .

  version: '3'

  services:
    spark-master:
      image: 'sandbox/spark'
      command: ["org.apache.spark.deploy.master.Master", "--ip", "spark-master", "--port", "7077", "--webui-port", "8080"]
      ports:
        - '7077:7077'
        - '8080:8080'

    spark-worker:
      image: 'sandbox/spark'
      command: ["org.apache.spark.deploy.worker.Worker", "spark://spark-master:7077", "--webui-port", "8081"]
      depends_on:
        - spark-master
      ports:
        - '8081:8081'

  networks:
      default:
          external:
              name: sandbox-cluster

There is one special thing inside this yaml file. (You might have encountered this already in Apache Hadoop - Setting up a local Test) The reference to the external network sandbox-cluster. Without this reference the cluster will live inside a default network created from Docker compose. In future we will join this network with our Spark cluster thus we already included this section.

To create such a Docker network issue the following command:

$ docker network create sandbox-cluster

Both spark-master and spark-worker will join the external Docker network sandbox-cluster. Besides the Spark port 7077 we map the --webui-ports 1:1 to the outside for our testing purposes. This is something we might reconsider when leaving the sandbox scenario.

Finally, fire up the cluster!

$ docker-compose -p spark-cluster up -d
Starting sparkcluster_spark-master_1 ...
Starting sparkcluster_spark-master_1 ... done
Starting sparkcluster_spark-worker_1 ...
Starting sparkcluster_spark-worker_1 ... done

Both the Spark master Web UI and worker Web UI will be available on the host.

Running the Spark Shell

The Spark shell is at our fingertips within the container built above. We need to override the default Docker image entrypoint spark-class with --entrypoint spark-shell and provide the coordinates of the Spark master: --master spark://spark-master:7077. If you are interested in the Spark shell's Web UI you'll need an additional --publish 4040:4040. The UI will be available on your host at port 4040.

We join the Docker network created by compose (within this network the hostname spark-master is known).

docker run -it --rm \
  --name spark-shell \
  --entrypoint spark-shell \
  --network sandbox-cluster \
  --publish 4040:4040 \
  -t sandbox/spark --master spark://spark-master:7077

Note: You can pass other arguments like --help instead of --master to get more information about available options of the Spark shell. Either way you should see the Spark shell connecting to the master.

Spark context available as 'sc' (master = spark://spark-master:7077, app id = app-20170927073036-0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.version
res0: String = 2.1.1

Exit
:q

The shell provides a :help command to get you started...use :q to quit/exit the shell.

With the basic setup tested it's about time to tackle the first Hello World example: Let's check the basic setup with the Quick Start sample:

scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:58

scala> textFile.count()
res3: Long = 104

scala> textFile.first()
res4: String = # Apache Spark

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:60

scala> linesWithSpark.count()
res5: Long = 20

scala> textFile.filter(line => line.contains("Spark")).count()
res6: Long = 20

scala> textFile.filter(line => line.contains("Apache")).count()
res7: Long = 2

If you got so far you most probably have a local Docker based standalone Spark cluster up and running.

Congratulations!

PS: In future posts we'll look into integrating the cluster with HDFS and into scaling the Spark cluster (maybe with Docker Swarm?).