Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IA-4839] [DO NOT MERGE] Terra on Azure (ToA) base jupyter docker image #483

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
fabcdaf
original attempt at creating a leaner terra base docker image
LizBaldo Oct 11, 2023
82c8ec9
remove opt/conda jupyter path
LizBaldo Oct 12, 2023
d5dc18a
use nbclassic instead of notebook to maintain backward compatibility …
LizBaldo Oct 12, 2023
1efe5a5
do not install notebook extensions anymore - should not be necessary …
LizBaldo Oct 12, 2023
6c42c84
better jupyter creation best practice, sudo permissions, and conda in…
LizBaldo Oct 25, 2023
9863460
this seems to work well on my BEE so saving it
LizBaldo Oct 26, 2023
9075907
latest backup
LizBaldo Oct 27, 2023
9a63654
use nbclassic to stay backward compatible with js extensions
LizBaldo Nov 1, 2023
0d4096e
working image with jupyter in base and user virtual environment
LizBaldo Apr 10, 2024
b3b1a28
isolating jupyter env instead of user venv
LizBaldo Apr 10, 2024
3a2ff3f
fixing conda env name display in jupyter terminal
LizBaldo Apr 11, 2024
2f7f7b4
adding smoke test, updating gha versions, and adding a readme
LizBaldo Apr 11, 2024
cfd2af2
try not running as root but granting sudo priviledges to jupyter user
LizBaldo Apr 16, 2024
7c1da78
addressing dockerfile comments
LizBaldo Apr 18, 2024
a72f84e
change jupyter-user uid and make run-jupyter.sh executable
LizBaldo Apr 24, 2024
2be78da
specify notebook dir and change ownership to jupyter user
LizBaldo May 2, 2024
333085b
only focus on changes for the new image
LizBaldo May 29, 2024
4f6da7d
more cleanup
LizBaldo May 29, 2024
ae65b98
try to ignore the platform flag when looking for base images
LizBaldo May 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions .github/workflows/test-terra-base-jupyter.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
name: Test terra-base-jupyter
# Perform smoke tests on the terra-base-jupyter Docker image to have some amount of confidence that
# Python package versions are compatible.
#
# To configure the minimal auth needed for these tests to be able to read public data from Google Cloud Platform:
# Step 1: Create a service account per these instructions:
# https://github.com/google-github-actions/setup-gcloud/blob/master/setup-gcloud/README.md
# Step 2: Give the service account the following permissions within the project: BigQuery User
# Step 3: Store its key and project id as GitHub repository secrets TD_GCP_SA_KEY and GCP_PROJECT_ID.
# https://docs.github.com/en/free-pro-team@latest/actions/reference/encrypted-secrets#creating-encrypted-secrets-for-a-repository

on:
pull_request:
branches: [ master ]
paths:
- 'terra-base-jupyter/**'
- '.github/workflows/test-terra-base-jupyter.yml'

push:
# Note: GitHub secrets are not passed to pull requests from forks. For community contributions from
# regular contributors, its a good idea for the contributor to configure the GitHub actions to run correctly
# in their fork as described above.
#
# For occasional contributors, the dev team will merge the PR fork branch to a branch in upstream named
# test-community-contribution-<PR#> to run all the GitHub Action smoke tests.
branches: [ 'test-community-contribution*' ]
paths:
- 'terra-base-jupyter/**'
- '.github/workflows/test-terra-base-jupyter.yml'

workflow_dispatch:
# Allows manually triggering of workflow on a selected branch via the GitHub Actions tab.
# GitHub blog demo: https://github.blog/changelog/2020-07-06-github-actions-manual-triggers-with-workflow_dispatch/.

env:
GOOGLE_PROJECT: ${{ secrets.GCP_PROJECT_ID }}

jobs:

test_docker_image:
runs-on: self-hosted

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'

- name: Free up some disk space
run: sudo rm -rf /usr/share/dotnet

- id: auth
uses: google-github-actions/auth@v2
with:
credentials_json: ${{ secrets.TD_GCP_SA_KEY }}
create_credentials_file: true

- name: Set up Cloud SDK
uses: google-github-actions/[email protected]
with:
project_id: ${{ secrets.GCP_PROJECT_ID }}

- name: Build Docker image and base images too, if needed
run: |
gcloud auth configure-docker
./build_smoke_test_image.sh terra-base-jupyter

- name: Upload workflow artifacts
uses: actions/upload-artifact@v2
with:
name: notebook-execution-results
path: terra-base-jupyter/tests/*.html
retention-days: 30

- name: Test Python code with pytest
run: |
chmod a+r "${{ steps.auth.outputs.credentials_file_path }}"
docker run \
--env GOOGLE_PROJECT \
--volume "${{ steps.auth.outputs.credentials_file_path }}":/tmp/credentials.json:ro \
--env GOOGLE_APPLICATION_CREDENTIALS="/tmp/credentials.json" \
--volume $GITHUB_WORKSPACE/terra-base-jupyter/tests:/tests \
--workdir=/tests \
--entrypoint="" \
terra-base-jupyter:smoke-test \
/bin/sh -c "pip3 install pytest; pytest"
Comment on lines +81 to +89
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to either drop this into a scripts/run tests (or something like this) and/or in combination use docker compose to set this up for the next person working on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I would love to brainstorm that with you because that might slightly fall outside of this PR scope. But the question here is how can we easily test these docker images? I am not convinced that the current GHAs that we have is the way to go

3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,5 @@ package-lock.json
.python_history
.keras/
.ammonite/
.metals/
.metals/
.venv/
2 changes: 1 addition & 1 deletion build_smoke_test_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ set -o xtrace
build_smoke_test_image() {
local IMAGE_TYPE=$1
pushd ${IMAGE_TYPE}
local BASE_IMAGES=$( egrep '^FROM (\S+)' Dockerfile |tr -s ' ' | cut -d ' ' -f 2 )
local BASE_IMAGES=$( egrep '^FROM (\S+)' Dockerfile | sed 's/--platform.*//' |tr -s ' ' | cut -d ' ' -f 2 )

local BASE_IMAGE
for BASE_IMAGE in ${BASE_IMAGES}; do
Expand Down
198 changes: 198 additions & 0 deletions terra-base-jupyter/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# Smallest image with ubuntu jammy, CUDA and NVDIA drivers installed - 80 mb
FROM --platform=linux/amd64 nvidia/cuda:12.2.0-base-ubuntu22.04

# Use bash as the shell, like the jupyter terminal (just nicer to work with than sh)
ENV SHELL /usr/bin/bash
SHELL ["/usr/bin/bash", "-c"]

#######################
# Environment Variables
#######################
ENV DEBIAN_FRONTEND noninteractive
ENV LC_ALL en_US.UTF-8

# We need node >18 for jupyter to work
ENV NODE_MAJOR 20

# Set the python version and corresponding conda installer
ENV PYTHON_VERSION 3.10
ENV CONDA_INSTALLER https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.1-0-Linux-x86_64.sh
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


###############
# Prerequisites
###############
RUN apt-get update && apt-get install -yq --no-install-recommends \
# basic necessities
sudo \
ca-certificates \
curl \
jq \
LizBaldo marked this conversation as resolved.
Show resolved Hide resolved
tree \
# gnupg requirement
gnupg \
dirmngr \
# useful utilities for debugging within docker itself
nano \
LizBaldo marked this conversation as resolved.
Show resolved Hide resolved
less \
procps \
lsb-release \
# gcc compiler
build-essential \
locales \
# for ssh-agent and ssh-add
keychain \
# extras \
wget \
aria2 \
bzip2 \
LizBaldo marked this conversation as resolved.
Show resolved Hide resolved
# git
git \
# Uncomment en_US.UTF-8 for inclusion in generation
&& sed -i 's/^# *\(en_US.UTF-8\)/\1/' /etc/locale.gen \
# Generate locale
&& locale-gen \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
LizBaldo marked this conversation as resolved.
Show resolved Hide resolved

# Install Node >18 (needed for jupyterlab)
RUN apt-get update && apt-get install -yq --no-install-recommends
RUN mkdir -p /etc/apt/keyrings
RUN curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | gpg --dearmor -o /etc/apt/keyrings/nodesource.gpg

RUN echo "deb [signed-by=/etc/apt/keyrings/nodesource.gpg] https://deb.nodesource.com/node_$NODE_MAJOR.x nodistro main" | tee /etc/apt/sources.list.d/nodesource.list
RUN apt-get update && apt-get install -f -yq nodejs

#############
# Users Setup
#############
# Create the welder user
# The welder uid is consistent with the Welder docker definition here:
# https://github.com/DataBiosphere/welder/blob/master/project/Settings.scala
# Adding welder-user to the Jupyter container isn't strictly required, but it makes welder-added
# files display nicer when viewed in a terminal.
ENV WELDER_USER welder-user
ENV WELDER_UID 1001
RUN useradd -m -N -u $WELDER_UID $WELDER_USER

# Create the jupyter user
ENV JUPYTER_USER jupyter-user
ENV JUPYTER_UID 1002
# Create the jupyter user home
ENV JUPYTER_USER_HOME /home/$JUPYTER_USER
RUN useradd -m -d $JUPYTER_USER_HOME -N -u $JUPYTER_UID -g users $JUPYTER_USER
# We want to grant the jupyter user sudo permissions
# without password so they can install the necessary packages that they
# want to use on the docker container
RUN echo "$JUPYTER_USER ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/$JUPYTER_USER \
&& chmod 0440 /etc/sudoers.d/$JUPYTER_USER

#####################################
# Install Python via Miniconda
#####################################
## Note: CONDA should NOT be used by terra devs to manage dependencies (see the use of poetry below instead),
## but is a widely used tool to manage python environments in a runtime and we should provide it to users
## We want to store the user conda environments in a directory
## that will be in the persistent disk
## Attention: If you change the Conda home location, please update conda_init.txt accordingly
ENV CONDA_ENV_NAME base-python${PYTHON_VERSION}
ENV CONDA_ENV_HOME $JUPYTER_USER_HOME/.envs/$CONDA_ENV_NAME
RUN curl -so $JUPYTER_USER_HOME/miniconda.sh ${CONDA_INSTALLER} \
&& chmod +x $JUPYTER_USER_HOME/miniconda.sh \
&& $JUPYTER_USER_HOME/miniconda.sh -b -p $CONDA_ENV_HOME \
&& rm $JUPYTER_USER_HOME/miniconda.sh
ENV PATH "${PATH}:${CONDA_ENV_HOME}/bin"

# Set up the path to the user python
ENV BASE_PYTHON_PATH $CONDA_ENV_HOME/bin/python
# Tell condo to NOT write bite code (aka these.pyc files)
ENV PYTHONDONTWRITEBYTECODE=true
LizBaldo marked this conversation as resolved.
Show resolved Hide resolved

###################################################
# Set up the user to use the conda base environment
###################################################
## The user should have full access to the conda base environment, and can use it directly, or
## create new conda environments on top of it. The important part is that jupyter IS NOT installed
## in the base environment to provide isolation between the user environment, and the jupyter server
LizBaldo marked this conversation as resolved.
Show resolved Hide resolved
## to avoid cross-contamination
COPY conda-environment.yml .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly the goal here, maybe using conda-pack can help further reduce the image size.
When we build dockers that uses conda to manage envs, we use conda-pack combined with multi-stage builds to reduce image sizes.
Example here:
https://github.com/broadinstitute/long-read-pipelines/blob/4b50b3857d33fd195461e5eb5c8a83d7fe6dda27/docker/lr-papermill-base/Dockerfile#L8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to this, it'd be good to have an understanding how large each layer is, before optimizing the sizes further.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good call out. I will add a detail of how big each layer is in the Readme. Regarding conda-pack, I definitely get why you are using it, but since the plan is to have the majority of Terra user use this base image, I was thinking about setting up the base conda environment directly.
The use case of building a custom image on top of it is a bit of an edge case, and so the base image might not be perfectly curated for your need. Would this be a problem?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this would not be a problem, just an optimization (which could be evil as it might be pre-mature). OTH, our experience with conda-pack is that it reduces the image size further.
This is where we borrowed the lesson. https://pythonspeed.com/articles/conda-docker-image-size/

I can give it a try once the other components are steady, actually since I'm the one mostly interested in it.

RUN conda env update --prefix $CONDA_ENV_HOME --file conda-environment.yml --prune \
# Remove packages tarballs and python bytecode files from the image
&& conda clean -afy \
&& rm conda-environment.yml \
# Make sure the JUPYTER_USER is the owner of the folder where
# the base conda is installed
&& chown -R $JUPYTER_USER:users $JUPYTER_USER_HOME

# Add the user base conda environment as a jupyter kernel - this should be the default now
# This commands activates the conda environment and then calls ipykernel from within
# to install it as a kernel under the same name
RUN conda run -p ${CONDA_ENV_HOME} python -m ipykernel install --name=$CONDA_ENV_NAME

# Prep the jupyter terminal to conda init and make sure the base conda environment is
# activated and the name is displayed in the terminal prompt
COPY conda_init.txt .
RUN cat conda_init.txt >> $JUPYTER_USER_HOME/.bashrc && \
printf "\nconda activate ${CONDA_ENV_HOME}" >> $JUPYTER_USER_HOME/.bashrc && \
conda config --set env_prompt '({name})' && \
source $JUPYTER_USER_HOME/.bashrc && \
rm conda_init.txt

####################################################
# Install Jupyter in an isolated virtual environment
####################################################
## Virtualenv and POETRY are the prefered tool to create virtual environments and
## manage dependencies for Terra Devs - poetry docs: https://python-poetry.org/docs/
ENV POETRY_HOME /opt/poetry
# Append POETRY_HOME to PATH
ENV PATH "${PATH}:${POETRY_HOME}/bin"
COPY poetry.lock .
COPY pyproject.toml .

ENV JUPYTER_HOME /usr/jupytervenv
# Add jupyter virtual environmemt to PATH,
# but make sure to add it at the end so that the
# Conda base python takes precedence
# (aka the ! operator in iPython shells should NOT access the jupyter virtualenvironment)
ENV PATH "${PATH}:${JUPYTER_HOME}/bin"

# Install Poetry, set up the virtual environment for jupyter to run and then cleanup / uninstall poetry
LizBaldo marked this conversation as resolved.
Show resolved Hide resolved
RUN curl -sSL https://install.python-poetry.org | POETRY_HOME=$POETRY_HOME $BASE_PYTHON_PATH \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I was thinking about is maybe not using the user python to install the jupyter server, but instead use the system one to provide another layer of isolation

# Create a virtual environment and activate it for poetry to use
&& $BASE_PYTHON_PATH -m venv $JUPYTER_HOME \
&& source $JUPYTER_HOME/bin/activate \
# Install python dependencies with poetry
&& poetry install --no-interaction --no-ansi --no-dev --no-cache \
# Cleanup
&& rm poetry.lock && rm pyproject.toml \
&& curl -sSL https://install.python-poetry.org | POETRY_HOME=$POETRY_HOME $BASE_PYTHON_PATH - --uninstall

# ##################################
# # Terra-specific Jupyter Utilities
# ##################################
# Ensure this matches c.ServerApp.port in 'jupyter_server_config.py'
ENV JUPYTER_PORT 8888
EXPOSE $JUPYTER_PORT

# Install the custom extensions to enable welder for file syncing
COPY custom $JUPYTER_HOME/etc/jupyter/custom
COPY custom/jupyter_delocalize.py $JUPYTER_HOME/lib/python${PYTHON_VERSION}/site-packages
COPY jupyter_server_config.py $JUPYTER_HOME/etc/jupyter

# Remove the jupyter environment from the list of available kernels so it is hidden from the user
# Note that this needs to be done with setting the c.KernelSpecManager.ensure_native_kernel flag
# to False in 'jupyter_server_config.py'
RUN $JUPYTER_HOME/bin/jupyter kernelspec remove python3 -y

# Copy the script that the service deploying to Terra (e.g. leonardo) will use for docker exec
COPY run-jupyter.sh $JUPYTER_HOME/run-jupyter.sh
RUN chmod +x $JUPYTER_HOME/run-jupyter.sh

# Set up the user and working directory, which is where the persistent disk will be mounted
USER $JUPYTER_USER
WORKDIR $JUPYTER_USER_HOME/persistent_disk

# Note: this entrypoint is provided for running Jupyter independently of Leonardo.
# When Leonardo deploys this image onto a cluster, the entrypoint is overwritten to enable
# additional setup inside the container before execution. Jupyter execution occurs when the
# init-actions.sh script uses 'docker exec' to call run-jupyter.sh.
ENTRYPOINT ["/usr/jupytervenv/bin/jupyter", "lab"]
Loading
Loading