Google Colab (Colaboratory) is a product from Google Research which allows you to write and execute python code in browser. It is especially well suited to machine learning, data analysis and education. It's a great tool which gives you the opportunity to play around with data scientist toolkit.
But when you are a student or you want to play around with some data science libraries, IMHO it's much better to do it locally, because with large datasets it might be much faster than free plan of Colab. Let's see how to configure local environment for your PySpark notebooks using Docker
and docker-compose
.
Why do this?
Well, Google Colab offers free version which is the subject of that article. It does not guarantee a high performance and that's the main issue when you work on large data sets. Of course if you want to spend some money on other plans like Colab Pro or Colab Pro+ you can stop reading this tutorial here.
Let's start
Create a file docker-compose.yml
with content like below
version: "3.9"
services:
jupyter:
image: "jupyter/pyspark-notebook:9e63909e0317"
ports:
- "8888:8888"
volumes:
- jupyter:/srv
volumes:
jupyter:
And now you can run docker image with command like
docker-compose up
In the log you should see the message with access token, something like this
http://127.0.0.1:8888/lab?token=9d8439....
Open that in your browser, and voila. Now you can work with Jupyter notebooks with persistent data storage (your notebooks will be saved in docker volume).