Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add Jupyter Notebook container #488

Closed
jeremyprime opened this issue Aug 24, 2022 · 2 comments · Fixed by #508
Closed

[FEATURE] Add Jupyter Notebook container #488

jeremyprime opened this issue Aug 24, 2022 · 2 comments · Fixed by #508
Assignees
Labels
docker Management of the developer environment enhancement New feature or request High Priority
Milestone

Comments

@jeremyprime
Copy link
Collaborator

Describe the solution you'd like

With the eventual inclusion of Jupyter Notebook examples (see #436, #478), we should provide a Jupyter Notebook container as part of our dev environment in order to run the examples.

Additional context

The difficulty will be configuring Jupyter to use our Spark container as the master instead of a local master, and ensuring all of the containers can communicate.

We may want to add the Jupyter container under a Docker Compose profile instead of always deploying it (since it is a large image).

@jeremyprime jeremyprime added enhancement New feature or request docker Management of the developer environment labels Aug 24, 2022
@Aryex
Copy link
Collaborator

Aryex commented Aug 26, 2022

Got a hardcoded pyspark jupyter container working with our cluster.
This was done by adding to our docker-compose.yml a jupyter container.
Additionally, we will need to synced the Spark and Python version between our cluster and the jupyter environment. At the time of this writing, the Spark version was already synced since we are using latest Spark for both container. For python, I had to sync the cluster's python version by adding to the client Dockerfile

RUN . /opt/bitnami/scripts/libcomponent.sh && component_unpack "python" "3.10.5-156" --checksum 0756ba4f37dc82759e718c524c543e444224b367a84da33e975553e72b64b143

which is then used as the build for our spark cluster

From here, I was able to connect a pyspark notebook to our cluster with

SparkSession.builder.master("spark://spark:7077")

You can verify that a connection was made when looking at the cluster's web UI.

@Aryex
Copy link
Collaborator

Aryex commented Aug 26, 2022

Future challenge would be to make this not hardcoded, in particular the Spark and Python versions.

  • Syncing Spark: This is probably easy since both bitnami and jupyter images provide tags by Spark version. However, jupyter Spark version only goes down to 3.1.1
  • Syncing Python: This is harder since bitnami does not tag by python. We may have to override both bitnami and jupyter's python install to ensure that they are synced to our specified version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docker Management of the developer environment enhancement New feature or request High Priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants