TL;DR Heads-up: This post isn't about an introduction to Spark itself.
This write-up walks you through the setup of a local Docker based Spark Cluster.
In the first section we describe how to setup a Spark standalone Cluster with Docker compose.
We will prepare a docker-compose.yml
file and fire up the cluster with a single command in the end:
$ docker-compose up
To test the setup we will connect to the running cluster with the Spark Shell (running inside a Docker container, too).
Let's get going - Hello Spark!
Apache Spark™ is a fast and general engine for large-scale data processing.
This post covers the setup of a standalone Spark cluster. In particular our initial setup doesn't include a Hadoop cluster. Some future posts we will probably dive into integrating with other Big Data technologies.
Setting up a standalone Spark Cluster with Docker Compose
Since there seem to be no official prepackaged Spark Docker images available we'll need to build one of our own. We started from wongnai/spark-standalone and did some small changes to reduce the number of files needed.
We stick to the initial idea to use the same image for master
and worker
nodes.
# inspired by https://hub.docker.com/r/wongnai/spark-standalone/
FROM openjdk:8u141-slim
ARG APACHE_MIRROR_SERVER=http://apache.mirror.digionline.de
ARG SPARK_VERSION=2.1.1
ARG HADOOP_VERSION=2.7
RUN apt-get update \
&& apt-get -y install wget \
&& rm -rf /var/lib/apt/lists/*
RUN mkdir -p /opt \
&& wget -q -O - ${APACHE_MIRROR_SERVER}/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz | tar -xzf - -C /opt \
&& mv /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} /opt/spark
ENV PATH=/opt/spark/bin:/opt/spark/sbin:$PATH
EXPOSE 6066 7077 7078 8080 8081
WORKDIR /opt/spark
ENTRYPOINT ["/opt/spark/bin/spark-class"]
CMD ["org.apache.spark.deploy.master.Master", "--ip spark-master", "--port 7077", "--webui-port 8080"]
A default CMD
is provided for convenience.
Without additional arguments docker run
will fire up a Spark master
.
Nothing special required to build the image:
$ docker build -t sandbox/spark .
No surprises in the service
sections of docker-compose.yml
either.
It is pretty straight forward when you look at Installing Spark Standalone to a Cluster
.
version: '3'
services:
spark-master:
image: 'sandbox/spark'
command: ["org.apache.spark.deploy.master.Master", "--ip", "spark-master", "--port", "7077", "--webui-port", "8080"]
ports:
- '7077:7077'
- '8080:8080'
spark-worker:
image: 'sandbox/spark'
command: ["org.apache.spark.deploy.worker.Worker", "spark://spark-master:7077", "--webui-port", "8081"]
depends_on:
- spark-master
ports:
- '8081:8081'
networks:
default:
external:
name: sandbox-cluster
There is one special thing inside this yaml
file.
(You might have encountered this already in Apache Hadoop - Setting up a local Test)
The reference to the external
network sandbox-cluster
.
Without this reference the cluster will live inside a default network created from Docker compose.
In future we will join this network with our Spark cluster thus we already included this section.
To create such a Docker network issue the following command:
$ docker network create sandbox-cluster
Both spark-master
and spark-worker
will join the external Docker network sandbox-cluster
.
Besides the Spark port 7077
we map the --webui-port
s 1:1 to the outside for our testing purposes.
This is something we might reconsider when leaving the sandbox scenario.
Finally, fire up the cluster!
$ docker-compose -p spark-cluster up -d
Starting sparkcluster_spark-master_1 ...
Starting sparkcluster_spark-master_1 ... done
Starting sparkcluster_spark-worker_1 ...
Starting sparkcluster_spark-worker_1 ... done
Both the Spark master Web UI and worker Web UI will be available on the host.
- Spark master: http://localhost:8080/
- Spark worker: http://localhost:8081/
Running the Spark Shell
The Spark shell is at our fingertips within the container built above.
We need to override the default Docker image entrypoint spark-class
with --entrypoint spark-shell
and provide the coordinates of the Spark master: --master spark://spark-master:7077
.
If you are interested in the Spark shell's Web UI you'll need an additional --publish 4040:4040
.
The UI will be available on your host at port 4040
.
We join the Docker network created by compose (within this network the hostname spark-master
is known).
docker run -it --rm \
--name spark-shell \
--entrypoint spark-shell \
--network sandbox-cluster \
--publish 4040:4040 \
-t sandbox/spark --master spark://spark-master:7077
Note: You can pass other arguments like --help
instead of --master
to get more information about available options of the Spark shell.
Either way you should see the Spark shell connecting to the master.
Spark context available as 'sc' (master = spark://spark-master:7077, app id = app-20170927073036-0001).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.version
res0: String = 2.1.1
Exit
:q
The shell provides a :help
command to get you started...use :q
to quit/exit the shell.
With the basic setup tested it's about time to tackle the first Hello World example: Let's check the basic setup with the Quick Start sample:
scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:58
scala> textFile.count()
res3: Long = 104
scala> textFile.first()
res4: String = # Apache Spark
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:60
scala> linesWithSpark.count()
res5: Long = 20
scala> textFile.filter(line => line.contains("Spark")).count()
res6: Long = 20
scala> textFile.filter(line => line.contains("Apache")).count()
res7: Long = 2
If you got so far you most probably have a local Docker based standalone Spark cluster up and running.
Congratulations!
PS: In future posts we'll look into integrating the cluster with HDFS and into scaling the Spark cluster (maybe with Docker Swarm?).