Apache Hadoop - Setting up a local Test Environment

In this post we dive into a big data ecosystem:

Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

In particular we highlight the HDFS module of Hadoop:

Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

With our strong Docker background we’ll show you how to setup and access a local HDFS cluster.

Setup

Let’s head-start into the details: The following docker-compose.yaml utilises the Docker images uhopper/hadoop-namenode and uhopper/hadoop-datanode.

version: '3'

services:
  hadoop-namenode:
    image: 'uhopper/hadoop-namenode'
    hostname: hadoop-namenode
    ports:
      - "8020:8020"
      - '50070:50070'
    environment:
      - CLUSTER_NAME=hadoop-sandbox
      - HDFS_CONF_dfs_replication=1

  hadoop-datanode:
    image: "uhopper/hadoop-datanode"
    environment:
      - CORE_CONF_fs_defaultFS=hdfs://hadoop-namenode:8020
      - CLUSTER_NAME=hadoop-sandbox
      - HDFS_CONF_dfs_replication=1
    depends_on:
      - "hadoop-namenode"

networks:
    default:
        external:
            name: sandbox-cluster

There is one special thing inside this yaml file. The reference to the external network sandbox-cluster. Without this reference the cluster will live inside a default network created from Docker compose. In future we will join this network with our Spark cluster thus we already included this section.

To create a Docker network issue the following command:

$ docker network create sandbox-cluster

Now we are fine to fire up the HDFS cluster itself with a single docker-compose

$ docker-compose -p hadoop-cluster up -d

Examples

Getting in touch with HDFS from inside a container:

$ docker run -it --rm --name hdfs-shell --network sandbox-cluster -e "CORE_CONF_fs_defaultFS=hdfs://hadoop-namenode:8020" -e "CLUSTER_NAME=hadoop-sandbox" -t uhopper/hadoop:2.7.2 /bin/bash
...
# 

Next we present some commands available in the Hadoop FileSystem Shell

Create directory in hdfs with mkdir - hdfs dfs -mkdir [-p] <paths>

$ hdfs dfs -mkdir /tmp

Copy file into hdfs with put - hdfs dfs -put <localsrc> ... <dst>

$ hdfs dfs -put sample.dat /tmp

List directory content with ls - hdfs dfs -ls <args>

$ hdfs dfs -ls /tmp
Found 1 items
-rw-r--r--   3 root supergroup  34212441 2017-10-01 21:04 /tmp/sample.dat

You can also run those commands directly via:

$ docker run -it --rm --network sandbox-cluster -e "CORE_CONF_fs_defaultFS=hdfs://hadoop-namenode:8020" -e "CLUSTER_NAME=hadoop-sandbox" -t uhopper/hadoop hdfs dfs -ls /

Outlook

Where to go from here?

We’ll look into accessing the Hadoop HDFS cluster with Spring.

Spring for Apache Hadoop Simplifies Apache Hadoop by providing a unified configuration model and easy to use APIs for using HDFS, …

For the impatient: Working with the Hadoop File System

Stay tuned…