In this post we dive into a big data ecosystem:
Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
In particular we highlight the HDFS module of Hadoop:
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
With our strong Docker background we’ll show you how to setup and access a local HDFS cluster.
Setup
Let’s head-start into the details: The following docker-compose.yaml
utilises the Docker images uhopper/hadoop-namenode and uhopper/hadoop-datanode.
version: '3'
services:
hadoop-namenode:
image: 'uhopper/hadoop-namenode'
hostname: hadoop-namenode
ports:
- "8020:8020"
- '50070:50070'
environment:
- CLUSTER_NAME=hadoop-sandbox
- HDFS_CONF_dfs_replication=1
hadoop-datanode:
image: "uhopper/hadoop-datanode"
environment:
- CORE_CONF_fs_defaultFS=hdfs://hadoop-namenode:8020
- CLUSTER_NAME=hadoop-sandbox
- HDFS_CONF_dfs_replication=1
depends_on:
- "hadoop-namenode"
networks:
default:
external:
name: sandbox-cluster
There is one special thing inside this yaml
file.
The reference to the external
network sandbox-cluster
.
Without this reference the cluster will live inside a default network created from Docker compose.
In future we will join this network with our Spark cluster thus we already included this section.
To create a Docker network issue the following command:
$ docker network create sandbox-cluster
Now we are fine to fire up
the HDFS cluster itself with a single docker-compose
$ docker-compose -p hadoop-cluster up -d
Examples
Getting in touch with HDFS from inside a container:
$ docker run -it --rm --name hdfs-shell --network sandbox-cluster -e "CORE_CONF_fs_defaultFS=hdfs://hadoop-namenode:8020" -e "CLUSTER_NAME=hadoop-sandbox" -t uhopper/hadoop:2.7.2 /bin/bash
...
#
Next we present some commands available in the Hadoop FileSystem Shell
Create directory in hdfs with mkdir
- hdfs dfs -mkdir [-p] <paths>
$ hdfs dfs -mkdir /tmp
Copy file into hdfs with put
- hdfs dfs -put <localsrc> ... <dst>
$ hdfs dfs -put sample.dat /tmp
List directory content with ls
- hdfs dfs -ls <args>
$ hdfs dfs -ls /tmp
Found 1 items
-rw-r--r-- 3 root supergroup 34212441 2017-10-01 21:04 /tmp/sample.dat
You can also run those commands directly via:
$ docker run -it --rm --network sandbox-cluster -e "CORE_CONF_fs_defaultFS=hdfs://hadoop-namenode:8020" -e "CLUSTER_NAME=hadoop-sandbox" -t uhopper/hadoop hdfs dfs -ls /
Outlook
Where to go from here?
We’ll look into accessing the Hadoop HDFS cluster with Spring.
Spring for Apache Hadoop Simplifies Apache Hadoop by providing a unified configuration model and easy to use APIs for using HDFS, …
For the impatient: Working with the Hadoop File System
Stay tuned…