The reason we use Hadoop in the first place is to deal with large quantities of data.  However, you'll probably definitely have to debug your code and uploading massive JARs to your AWS / GCP machines / running on large datasets is a pain!  So why not just install Hadoop locally and run small portions of the code?  You can definitely go that route, but it's much easier said than done; Hadoop requires quite a bit of config to get going.  

Enter Docker: user-friendly containerization.  What are containers?  There's so much information available on containers because they're super in nowadays, but basically they're like little self-contained applications that contain all the dependencies you need for a single logical task.  They're super useful because it (mostly) gets rid of the "works / compiles on my machine" problem.  Docker made a super nice interface for developers to make use of containers easily including dealing with storage volumes, networking, etc...  From our perspective, the container will kinda feel like it's own computer, but it's best not to think of them like that.  More reading on Docker / containerization if you're interested:

Installation

On mac and windows you must have virtualization enabled, since Docker runs a minimal Alpine Linux Virtual Machine behind the scenes.

 

Refer to your installation's instructions to get going, but basically each of these installations should provide you with access to a terminal that has docker in its PATH.  Use that for everything we do below.  To test if you're in the right place, simply type 

$ docker

at your prompt.  You should get a help message.

Hadoop in Docker

I've got an up-to-date Docker image running Hadoop 2.7.4 and using the JDK 1.8u144 on GitHub and DockerHub.  The GitHub readme pretty much lays it all out, but we'll go through it here as well.  So Docker has a concept of "images", which are a portable containers that have certain services set up on them.  For example, you can run an nginx container and have an nginx server up without any work on your part.  In our case, we'll run this Hadoop image and get a local "installation" of Hadoop up in no time.  To get going, run

$ docker pull gvacaliuc/hadoop-docker #wait for everything to download
$ docker run -h superCoolHostname -it gvacaliuc/hadoop-docker /etc/bootstrap.sh -bash

which pulls down the image and then runs the container.  The -h superCoolHostname sets the hostname of the machine and the -it tells Docker to attach a pseudo-terminal to STDIN of the container.  The bits after the image name, /etc/bootstrap.sh -bash instructs Docker to run the script at that location with the bash flag.  If you take a look at the script you'll see that it sets some configuration and then starts some services we need for Hadoop.  After you run the above command, you'll get dropped in a bash shell on a kinda-machine-but-not-really that's running Hadoop.  You've got all the same stuff as on AWS!

 

OK, so we can run Hadoop in this container-thing, how do I get my JAR onto it? Using Docker volumes! Just pass a "-v /absolute/path/to/hostDir:/absolute/path/to/guestDir":

$ docker run -h superCoolHostname \
	-v /home/username/super/organized/path/to/folder:/path/to/guestDir \
	-it gvacaliuc/hadoop-docker /etc/bootstrap.sh -bash

Note that this assumes you're running your docker commands from a terminal that can access your file system with UNIX-style paths.

 

If you need to launch another terminal inside the container, you can opt for docker exec:

$ docker ps # get list of running containers
# sample output
CONTAINER ID        IMAGE                     COMMAND                  CREATED             STATUS              PORTS                                                                                                                  NAMES
1386084eb7c9        gvacaliuc/hadoop-docker   "/etc/bootstrap.sh -d"   2 minutes ago       Up 2 minutes        2122/tcp, 8020/tcp, 9000/tcp, 10020/tcp, 19888/tcp, 49707/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp   kind_hawking


#	note the container ID from above
$ docker exec -it 1386084eb7c9 bash
 
#	if this container was the last one you started, and you're running Unix you can do:
$ docker exec -it $(docker ps -lq) bash

 

docker-compose

Do I really need to run all these commands every time I need to run hadoop?? Nope!  docker-compose is a simple tool that reads in YAML (kinda like JSON except Python-y instead of Javascript-y) text that specifies how to run everything.  (Installation for Linux, Mac / Windows bundle it with Docker)  Here's an example of one that mounts the current directory to /launchdir in the container:

docker-compose.yaml
version: '3'
services:
  hadoop:
    image: gvacaliuc/hadoop-docker:latest
    hostname: hadoop-master
    #   attaches the directory docker-compose was launched from
    volumes:
      - .:/launchdir
    command: /etc/bootstrap.sh -bash

 

It also sets the hostname to be hadoop-master, starts a bash shell, and uses the lastest image.  To run

/path/to/dir/containing/docker-compose.yml$ docker-compose run hadoop

 

Benefit

The huge benefit here is speed of development.  Rather than having to upload your JAR each time you fix a bug, you can simply mount a directory that contains your projects output directory and always have the updated JAR available.  Of course, your machine isn't going to have the power of a cluster of machines on AWS EMR or GCP Dataproc, but that's not the point.  The point is to speed up development until you're pretty certain your code is working right.

  • No labels