How To Have An Environment With Spark in Less Than 30 Seconds Thanks To Docker

Edgar Pérez Sampedro Data Processing

I am sure you have ever wanted to experiment with some technology, but your eagerness to learn has been diminished by the difficulty of installing everything you need in your operating system. In this article we will explain how to start playing with spark regardless of the operating system we have and without having to go through the installation or execution.

Quick start

If you are curious about Spark or need to use it, you only need to have docker installed on your computer, so you can follow the official documentation. If you already have it installed, just follow four simple steps:

  1. Download our .zip project from our repository.
  2. Unzip the project and place it with the terminal/CMD and place it in the directory where our docker-compose.yml is located.
  3. Execute the following command: $ sudo docker-compose up
  4. Copy the url that appears on the terminal and copy it to your browser (preferably Chrome or Firefox) as indicated in the terminal/CMD log
  5. You will already have your environment ready to run with pyspark, we have left you a sample code in the notebook directory to test run.

How to create your own custom Docker image

If you are curious about how the docker image has been mounted or modified in order to include the anaconda kernels of your choice or you are already using docker and want to include some part of it in your docker image, then we will describe how the process has been done to make this possible.

Jupyter-debian image

These are fragments taken from the Dockerfile, where the different pieces that make up the environment are described. This image contains part of Jupyter Notebook configured as an iPython interface to run code, Anaconda for the control of the python libraries to be used, Spark in its version 2. x and all the environment variables configured.



With this you can have a Spark environment to launch your tests without having to brood over with the configuration or have to follow a thousand tutorials to get something done with Spark, thanks to Docker we can get away from the tedious tasks of configuration and testing new technologies easily without having to install all the dependencies.


If you want to know more about Docker and its usefulness you can visit the Docker’s official website.