Running Jupyter Notebook with PySpark Using Docker on Linux

Introduction

In today’s world of data processing and analysis, tools like Apache Spark and Jupyter Notebook have become essential. Apache Spark is a powerful distributed computing system that simplifies big data processing, while PySpark, its Python API, enables seamless integration with Python’s rich data analytics ecosystem. Jupyter Notebook provides an interactive web-based environment to write, debug, and run Python code, making it a favorite among data scientists and engineers.

Docker
Docker

Docker, on the other hand, is a platform that allows developers to package applications into containers—standardized units that include everything needed to run the application. This eliminates the “it works on my machine” problem and ensures consistent behavior across different environments.

In this blog, we will learn how to run a PySpark-enabled Jupyter Notebook using Docker on a Linux machine. This approach ensures an isolated and clean setup without manually installing Spark, Python, or Jupyter Notebook on your local machine.


Prerequisites

Before starting, ensure you have the following installed on your system:

  1. Docker
    • Install Docker using your Linux distribution’s package manager.
    • For example, on Ubuntu:sudo apt update sudo apt install docker.io
    • Verify Docker installation by running the following command:
    • docker --version
  2. A Stable Internet Connection
    • Required to pull the Docker image for PySpark and Jupyter Notebook.

Steps to Run Jupyter Notebook with PySpark Using Docker

Step 1: Pull the Docker Image

The first step is to pull the jupyter/pyspark-notebook Docker image. This image includes Jupyter Notebook, Python, and Apache Spark pre-installed.

Run the following command in your terminal:

docker pull jupyter/pyspark-notebook

This may take a few minutes, depending on your internet speed.

Step 2: Run the Docker Container

Once the image is downloaded, you can run the container using the following command:

docker run -p 8888:8888 -v /path/to/your/directory:/home/jovyan/work jupyter/pyspark-notebook

Explanation of the Command:

  • -p 8888:8888: Maps port 8888 of the container to port 8888 on your host machine, making the Jupyter Notebook accessible at http://localhost:8888.
  • -v /path/to/your/directory:/home/jovyan/work: Mounts your local directory (replace /path/to/your/directory with your desired path) to /home/jovyan/work in the container. This allows you to save your work persistently on your local system.

For example, if your local working directory is /home/user/projects, the command would be:

docker run -p 8888:8888 -v /home/user/projects:/home/jovyan/work jupyter/pyspark-notebook

Step 3: Access Jupyter Notebook

After running the above command, the terminal will display a URL similar to:

http://127.0.0.1:8888/?token=<token>
  1. Copy this URL.
  2. Open a browser and paste the URL to access Jupyter Notebook.

Step 4: Test PySpark in Jupyter Notebook

To ensure everything is set up correctly, create a new notebook in Jupyter and use the following code to initialize a Spark session and run a sample task:

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("PySpark Example").getOrCreate()

# Sample DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)
df.show()

If the code executes successfully, PySpark is working correctly.

Step 5: Stop the Docker Container

To stop the running container, press Ctrl+C in the terminal where the container is running. Alternatively, use the following commands:

  1. List running containers:docker ps
  2. Stop the container by its ID:docker stop <container_id>

Optional: Save Your Work

All notebooks and data saved in the mounted directory (e.g., /home/user/projects) will persist even after the container stops.


Troubleshooting Tips

  1. Port Conflict: If port 8888 is already in use, specify another port (e.g., 8890):docker run -p 8890:8888 -v /home/user/projects:/home/jovyan/work jupyter/pyspark-notebook
  2. Permission Issues: Ensure your user has permission to access the specified local directory. If not, change ownership or permissions using:sudo chown $USER:$USER /path/to/your/directory
  3. Access Issues: If the URL doesn’t work, ensure Docker is running and there are no firewall rules blocking the connection.

Conclusion

Using Docker to run Jupyter Notebook with PySpark on Linux is a hassle-free way to set up your environment. This method avoids the complexities of manual installation and ensures a clean, isolated environment for data processing tasks. With Docker, you can experiment with PySpark, run big data workloads, and develop interactive notebooks without worrying about dependency conflicts.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *