PySpark: Local instance of Jupyter as an alternative to Google Colaboratory [Docker]

Jul 27, 2022 1 minute read
Man Standing Infront of White Board - Photo by Christina Morillo (pexels.com)

Google Colab (Colaboratory) is a product from Google Research which allows you to write and execute python code in browser. It is especially well suited to machine learning, data analysis and education. It's a great tool which gives you the opportunity to play around with data scientist toolkit.

But when you are a student or you want to play around with some data science libraries, IMHO it's much better to do it locally, because with large datasets it might be much faster than free plan of Colab. Let's see how to configure local environment for your PySpark notebooks using Docker and docker-compose.

Why do this?

Well, Google Colab offers free version which is the subject of that article. It does not guarantee a high performance and that's the main issue when you work on large data sets. Of course if you want to spend some money on other plans like Colab Pro or Colab Pro+ you can stop reading this tutorial here.

Let's start

Create a file docker-compose.yml with content like below

version: "3.9"
services:
  jupyter:
    image: "jupyter/pyspark-notebook:9e63909e0317"
    ports:
      - "8888:8888"
    volumes:
      - jupyter:/srv
volumes:
  jupyter:

And now you can run docker image with command like

docker-compose up

In the log you should see the message with access token, something like this

http://127.0.0.1:8888/lab?token=9d8439....

Open that in your browser, and voila. Now you can work with Jupyter notebooks with persistent data storage (your notebooks will be saved in docker volume).

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.
Comments (0)