Hello Spark - First Experiments with the Apache Cluster Computing System Spark

TL;DR Heads-up: This post isn't about an introduction to Spark itself. This write-up walks you through the setup of a local Docker based Spark Cluster. In the first section we describe how to setup a Spark standalone Cluster with Docker compose. We will prepare a docker-compose.yml file and fire up the cluster with a single command in the end:

$ docker-compose up

To test the setup we will connect to the running cluster with the Spark Shell (running inside a Docker container, too).

Let's get going - Hello Spark!

Apache Spark™ is a fast and general engine for large-scale data processing.

This post covers the setup of a standalone Spark cluster. In particular our initial setup doesn't include a Hadoop cluster. Some future posts we will probably dive into integrating with other Big Data technologies.

Setting up a standalone Spark Cluster with Docker Compose

Since there seem to be no official prepackaged Spark Docker images available we'll need to build one of our own. We started from wongnai/spark-standalone and did some small changes to reduce the number of files needed.

We stick to the initial idea to use the same image for master and worker nodes.

# inspired by https://hub.docker.com/r/wongnai/spark-standalone/
FROM openjdk:8u141-slim

ARG APACHE_MIRROR_SERVER=http://apache.mirror.digionline.de
ARG SPARK_VERSION=2.1.1
ARG HADOOP_VERSION=2.7

RUN apt-get update \
 && apt-get -y install wget \
 && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /opt \
 && wget -q -O - ${APACHE_MIRROR_SERVER}/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz | tar -xzf - -C /opt \
 && mv /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} /opt/spark

ENV PATH=/opt/spark/bin:/opt/spark/sbin:$PATH

EXPOSE 6066 7077 7078 8080 8081

WORKDIR /opt/spark

ENTRYPOINT ["/opt/spark/bin/spark-class"]
CMD ["org.apache.spark.deploy.master.Master", "--ip spark-master", "--port 7077", "--webui-port 8080"]

A default CMD is provided for convenience. Without additional arguments docker run will fire up a Spark master.

Nothing special required to build the image:

$ docker build -t sandbox/spark .

No surprises in the service sections of docker-compose.yml either. It is pretty straight forward when you look at Installing Spark Standalone to a Cluster .

  version: '3'

  services:
    spark-master:
      image: 'sandbox/spark'
      command: ["org.apache.spark.deploy.master.Master", "--ip", "spark-master", "--port", "7077", "--webui-port", "8080"]
      ports:
        - '7077:7077'
        - '8080:8080'

    spark-worker:
      image: 'sandbox/spark'
      command: ["org.apache.spark.deploy.worker.Worker", "spark://spark-master:7077", "--webui-port", "8081"]
      depends_on:
        - spark-master
      ports:
        - '8081:8081'

  networks:
      default:
          external:
              name: sandbox-cluster

There is one special thing inside this yaml file. (You might have encountered this already in Apache Hadoop - Setting up a local Test) The reference to the external network sandbox-cluster. Without this reference the cluster will live inside a default network created from Docker compose. In future we will join this network with our Spark cluster thus we already included this section.

To create such a Docker network issue the following command:

$ docker network create sandbox-cluster

Both spark-master and spark-worker will join the external Docker network sandbox-cluster. Besides the Spark port 7077 we map the --webui-ports 1:1 to the outside for our testing purposes. This is something we might reconsider when leaving the sandbox scenario.

Finally, fire up the cluster!

$ docker-compose -p spark-cluster up -d
Starting sparkcluster_spark-master_1 ...
Starting sparkcluster_spark-master_1 ... done
Starting sparkcluster_spark-worker_1 ...
Starting sparkcluster_spark-worker_1 ... done

Both the Spark master Web UI and worker Web UI will be available on the host.

Spark master: http://localhost:8080/
Spark worker: http://localhost:8081/

Running the Spark Shell

The Spark shell is at our fingertips within the container built above. We need to override the default Docker image entrypoint spark-class with --entrypoint spark-shell and provide the coordinates of the Spark master: --master spark://spark-master:7077. If you are interested in the Spark shell's Web UI you'll need an additional --publish 4040:4040. The UI will be available on your host at port 4040.

We join the Docker network created by compose (within this network the hostname spark-master is known).

docker run -it --rm \
  --name spark-shell \
  --entrypoint spark-shell \
  --network sandbox-cluster \
  --publish 4040:4040 \
  -t sandbox/spark --master spark://spark-master:7077

Note: You can pass other arguments like --help instead of --master to get more information about available options of the Spark shell. Either way you should see the Spark shell connecting to the master.

Spark context available as 'sc' (master = spark://spark-master:7077, app id = app-20170927073036-0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.version
res0: String = 2.1.1

Exit
:q

The shell provides a :help command to get you started...use :q to quit/exit the shell.

With the basic setup tested it's about time to tackle the first Hello World example: Let's check the basic setup with the Quick Start sample:

scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:58

scala> textFile.count()
res3: Long = 104

scala> textFile.first()
res4: String = # Apache Spark

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:60

scala> linesWithSpark.count()
res5: Long = 20

scala> textFile.filter(line => line.contains("Spark")).count()
res6: Long = 20

scala> textFile.filter(line => line.contains("Apache")).count()
res7: Long = 2

If you got so far you most probably have a local Docker based standalone Spark cluster up and running.

Congratulations!

PS: In future posts we'll look into integrating the cluster with HDFS and into scaling the Spark cluster (maybe with Docker Swarm?).