Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no import change notebooks updates #830

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions notebooks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,13 +34,27 @@ To run notebooks using Spark local mode on a server with one or more NVIDIA GPUs
8. **OPTIONAL**: If you have multiple GPUs in your server, replace the `CUDA_VISIBLE_DEVICES` setting in step 4 with a comma-separated list of the corresponding indices. For example, for two GPUs use `CUDA_VISIBLE_DEVICES=0,1`.

## No import change
In these notebooks, the GPU accelerated implementations of algorithms in Spark MLlib are enabled via import statements from the `spark_rapids_ml` package. Alternatively, acceleration can also be enabled by executing the following import statement at the start of a notebook:
In the default notebooks, the GPU accelerated implementations of algorithms in Spark MLlib are enabled via import statements from the `spark_rapids_ml` package.

Alternatively, acceleration can also be enabled by executing the following import statement at the start of a notebook:
```
import spark_rapids_ml.install
```
After executing a cell with this command, all subsequent imports and accesses of supported accelerated classes from `pyspark.ml` will automatically redirect and return their counterparts in `spark_rapids_ml`. Unaccelerated classes will import from `pyspark.ml` as usual. Thus, with the above single import statement, all supported acceleration in an existing `pyspark` notebook is enabled with no additional import statement or code changes. Directly importing from `spark_rapids_ml` also still works (needed for non-MLlib algorithms like UMAP).
or by modifying the PySpark/Jupyter launch command above to use a CLI `pyspark-rapids` installed by our `pip` package to start Jupyter with pyspark as follows:
```bash
rishic3 marked this conversation as resolved.
Show resolved Hide resolved
cd spark-rapids-ml/notebooks

PYSPARK_DRIVER_PYTHON=jupyter \
PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=0.0.0.0' \
CUDA_VISIBLE_DEVICES=0 \
pyspark-rapids --master local[12] \
--driver-memory 128g \
--conf spark.sql.execution.arrow.pyspark.enabled=true
```

After executing either of the above, all subsequent imports and accesses of supported accelerated classes from `pyspark.ml` will automatically redirect and return their counterparts in `spark_rapids_ml`. Unaccelerated classes will import from `pyspark.ml` as usual. Thus, all supported acceleration in an existing `pyspark` notebook is enabled with no additional import statement or code changes. Directly importing from `spark_rapids_ml` also still works (needed for non-MLlib algorithms like UMAP).

For an example, see the notebook [kmeans-no-import-change.ipynb](kmeans-no-import-change.ipynb).
For an example notebook, see the notebook [kmeans-no-import-change.ipynb](kmeans-no-import-change.ipynb).

*Note*: As of this release, in this mode, the remaining unsupported methods and attributes on accelerated classes and objects will still raise exceptions.

Expand Down
35 changes: 11 additions & 24 deletions notebooks/aws-emr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,25 +15,22 @@ If you already have a AWS EMR account, you can run the example notebooks on an E
export S3_BUCKET=<your_s3_bucket_name>
aws s3 mb s3://${S3_BUCKET}
```
- Create a zip file for the `spark-rapids-ml` package.
- Upload the initialization script to S3.
```
cd spark-rapids-ml/python/src
zip -r spark_rapids_ml.zip spark_rapids_ml
```
- Upload the zip file and the initialization script to S3.
```
aws s3 cp spark_rapids_ml.zip s3://${S3_BUCKET}/spark_rapids_ml.zip
cd ../../notebooks/aws-emr
aws s3 cp init-bootstrap-action.sh s3://${S3_BUCKET}/init-bootstrap-action.sh
aws s3 cp init-bootstrap-action.sh s3://${S3_BUCKET}/
```
- Print out available subnets in CLI then pick a SubnetId (e.g. subnet-0744566f of AvailabilityZone us-east-2a).

```
aws ec2 describe-subnets
export SUBNET_ID=<your_SubnetId>
```

If this is your first time using EMR notebooks via EMR Studio and EMR Workspaces, we recommend creating a fresh VPC and subnets with internet access (the initialization script downloads artifacts) meeting the EMR requirements, per EMR documentation, and then specifying one of the new subnets in the above.

- Create a cluster with at least two single-gpu workers. You will obtain a ClusterId in terminal. Noted three GPU nodes are requested here, because EMR cherry picks one node (either CORE or TASK) to run JupyterLab service for notebooks and will not use the node for compute.

If you wish to also enable [no-import-change](../README.md#no-import-change) UX for the cluster, change the init script argument `Args=[--no-import-enabled,0]` to `Args=[--no-import-enabled,1]` below. The init script `init-bootstrap-action.sh` checks this argument and modifies the runtime accordingly.

```
export CLUSTER_NAME="spark_rapids_ml"
Expand All @@ -50,24 +47,14 @@ If you already have a AWS EMR account, you can run the example notebooks on an E
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.2xlarge \
InstanceGroupType=CORE,InstanceCount=3,InstanceType=g4dn.2xlarge \
--configurations file://${CUR_DIR}/init-configurations.json \
--bootstrap-actions Name='Spark Rapids ML Bootstrap action',Path=s3://${S3_BUCKET}/init-bootstrap-action.sh
--bootstrap-actions Name='Spark Rapids ML Bootstrap action',Path=s3://${S3_BUCKET}/init-bootstrap-action.sh,Args=[--no-import-enabled,0]
```
- In the [AWS EMR console](https://console.aws.amazon.com/emr/), click "Clusters", you can find the cluster id of the created cluster. Wait until all the instances have the Status turned to "Running".
- In the [AWS EMR console](https://console.aws.amazon.com/emr/), click "Workspace(Notebooks)", then create a workspace. Wait until the status becomes ready and a JupyterLab webpage will pop up.
- In the [AWS EMR console](https://console.aws.amazon.com/emr/), click "Clusters", you can find the cluster id of the created cluster. Wait until the cluster has the "Waiting" status.
- To use notebooks on EMR you will need an EMR Studio and an associated Workspace. If you don't already have these, in the [AWS EMR console](https://console.aws.amazon.com/emr/), on the left, in the "EMR Studio" section, click the respective "Studio" and "Workspace (Notebooks)" links and follow instructions to create them. When creating a Studio, select the `Custom` setup option to allow for configuring a VPC and a Subnet. These should match the VPC and Subnet used for the cluster. Select "\*Default\*" for all security group prompts and drop downs for Studio and Workspace setting. Please check/search EMR documentation for further instructions.

- Enter the created workspace. Click the "Cluster" button (usually the top second button of the left navigation bar). Attach the workspace to the newly created cluster through cluster id.
- In the "Workspace (Notebooks)" list of workspaces, select the created workspace, make sure it has the "Idle" status (select "Stop" otherwise in the "Actions" drop down) and click "Attach" to attach the newly created cluster through cluster id to the workspace.

- Use the default notebook or create/upload a new notebook. Set the notebook kernel to "PySpark".
- Use the default notebook or create/upload a new notebook. Set the notebook kernel to "PySpark". For the no-import-change UX, you can try the example [kmeans-no-import-change.ipynb](../kmeans-no-import-change.ipynb).

- Add the following to a new cell at the beginning of the notebook. Replace "s3://path/to/spark\_rapids\_ml.zip" with the actual s3 path.
```
%%configure -f
{
"conf":{
"spark.submit.pyFiles": "s3://path/to/spark_rapids_ml.zip"
}
}

```
- Run the notebook cells.
**Note**: these settings are for demonstration purposes only. Additional tuning may be required for optimal performance.
38 changes: 34 additions & 4 deletions notebooks/aws-emr/init-bootstrap-action.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash
# Copyright (c) 2024, NVIDIA CORPORATION.
# Copyright (c) 2025, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -35,8 +35,38 @@ sudo /usr/local/bin/pip3.10 install --upgrade pip
sudo /usr/local/bin/pip3.10 install scikit-learn

# install cudf and cuml
sudo /usr/local/bin/pip3.10 install --no-cache-dir cudf-cu12 --extra-index-url=https://pypi.nvidia.com --verbose
sudo /usr/local/bin/pip3.10 install --no-cache-dir cuml-cu12 cuvs-cu12 --extra-index-url=https://pypi.nvidia.com --verbose

sudo /usr/local/bin/pip3.10 install --no-cache-dir cudf-cu12~=${RAPIDS_VERSION} \
cuml-cu12~=${RAPIDS_VERSION} \
cuvs-cu12~=${RAPIDS_VERSION} \
--extra-index-url=https://pypi.nvidia.com --verbose
sudo /usr/local/bin/pip3.10 install spark-rapids-ml
sudo /usr/local/bin/pip3.10 list

# set up no-import-change for cluster if enabled
if [[ $1 == "--no-import-enabled" && $2 == 1 ]]; then
echo "enabling no import change in cluster" 1>&2
cd /usr/lib/livy/repl_2.12-jars
sudo jar xf livy-repl_2.12*.jar fake_shell.py
sudo sed -i fake_shell.py -e '/from __future__/ s/\(.*\)/\1\ntry:\n import spark_rapids_ml.install\nexcept:\n pass\n/g'
sudo jar uf livy-repl_2.12*.jar fake_shell.py
sudo rm fake_shell.py
fi

# ensure notebook comes up in python 3.10 by using a background script that waits for an
# application file to be installed before modifying.
cat <<EOF >/tmp/mod_start_kernel.sh
#!/bin/bash
set -ex
while [ ! -f /mnt/notebook-env/bin/start_kernel_as_emr_notebook.sh ]; do
echo "waiting for /mnt/notebook-env/bin/start_kernel_as_emr_notebook.sh"
sleep 10
done
echo "done waiting"
sleep 10
sudo sed -i /mnt/notebook-env/bin/start_kernel_as_emr_notebook.sh -e 's#"spark.pyspark.python": "python3"#"spark.pyspark.python": "/usr/local/bin/python3.10"#g'
sudo sed -i /mnt/notebook-env/bin/start_kernel_as_emr_notebook.sh -e 's#"spark.pyspark.virtualenv.enabled": "true"#"spark.pyspark.virtualenv.enabled": "false"#g'
exit 0
EOF
sudo bash /tmp/mod_start_kernel.sh &
exit 0

3 changes: 2 additions & 1 deletion notebooks/aws-emr/init-configurations.json
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,6 @@
"spark.rapids.sql.explain":"ALL",
"spark.rapids.memory.gpu.reserve":"20",
"spark.rapids.sql.python.gpu.enabled":"true",
"spark.rapids.memory.pinnedPool.size":"2G",
"spark.rapids.sql.batchSizeBytes":"512m",
"spark.locality.wait":"0",
"spark.sql.execution.sortBeforeRepartition":"false",
Expand All @@ -70,6 +69,8 @@
"spark.sql.cache.serializer":"com.nvidia.spark.ParquetCachedBatchSerializer",
"spark.pyspark.python":"/usr/local/bin/python3.10",
"spark.pyspark.driver.python":"/usr/local/bin/python3.10",
"spark.pyspark.virtualenv.enabled":"false",
"spark.yarn.appMasterEnv.PYSPARK_PYTHON":"/usr/local/bin/python3.10",
"spark.dynamicAllocation.enabled":"false",
"spark.driver.memory":"20g",
"spark.rpc.message.maxSize":"512",
Expand Down
42 changes: 10 additions & 32 deletions notebooks/databricks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,44 +7,20 @@ If you already have a Databricks account, you can run the example notebooks on a
export PROFILE=spark-rapids-ml
databricks configure --token --profile ${PROFILE}
```
- Create a zip file for the `spark-rapids-ml` package.
- Copy the init scripts to your *workspace* (not DBFS) (ex. workspace directory: /Users/< databricks-user-name >/init_scripts).
```bash
cd spark-rapids-ml/python/src
zip -r spark_rapids_ml.zip spark_rapids_ml
```
- Copy the zip file to DBFS, setting `SAVE_DIR` to the directory of your choice.
```bash
export SAVE_DIR="/path/to/save/artifacts"
databricks fs cp spark_rapids_ml.zip dbfs:${SAVE_DIR}/spark_rapids_ml.zip --profile ${PROFILE}
```
- Edit the [init-pip-cuda-11.8.sh](init-pip-cuda-11.8.sh) init script to set the `SPARK_RAPIDS_ML_ZIP` variable to the DBFS location used above.
```bash
cd spark-rapids-ml/notebooks/databricks
sed -i"" -e "s;/path/to/zip/file;${SAVE_DIR}/spark_rapids_ml.zip;" init-pip-cuda-11.8.sh
export WS_SAVE_DIR="/path/to/directory/in/workspace"
databricks workspace mkdirs ${WS_SAVE_DIR} --profile ${PROFILE}
databricks workspace import --format AUTO --file init-pip-cuda-11.8.sh ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE}
```
**Note**: the `databricks` CLI requires the `dbfs:` prefix for all DBFS paths, but inside the spark nodes, DBFS will be mounted to a local `/dbfs` volume, so the path prefixes will be slightly different depending on the context.

**Note**: this init script does the following on each Spark node:
**Note**: the init script does the following on each Spark node:
- updates the CUDA runtime to 11.8 (required for Spark Rapids ML dependencies).
- downloads and installs the [Spark-Rapids](https://github.com/NVIDIA/spark-rapids) plugin for accelerating data loading and Spark SQL.
- installs various `cuXX` dependencies via pip.

- Copy the modified `init-pip-cuda-11.8.sh` init script to your *workspace* (not DBFS) (ex. workspace directory: /Users/< databricks-user-name >/init_scripts).
```bash
export WS_SAVE_DIR="/path/to/directory/in/workspace"
databricks workspace mkdirs ${WS_SAVE_DIR} --profile ${PROFILE}
```
For Mac
```bash
databricks workspace import --format AUTO --content $(base64 -i init-pip-cuda-11.8.sh) ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE}
```
For Linux
```bash
databricks workspace import --format AUTO --content $(base64 -w 0 init-pip-cuda-11.8.sh) ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE}
```
- if the cluster environment variable `SPARK_RAPIDS_ML_NO_IMPORT_ENABLED=1` is define (see below), the init script also modifies a Databricks notebook kernel startup script to enable no-import change UX for the cluster. See [no-import-change](../README.md#no-import-change).
- Create a cluster using **Databricks 13.3 LTS ML GPU Runtime** using at least two single-gpu workers and add the following configurations to the **Advanced options**.
- **Init Scripts**
- add the workspace path to the uploaded init script, e.g. `${WS_SAVE_DIR}/init-pip-cuda-11.8.sh`.
- add the workspace path to the uploaded init script `${WS_SAVE_DIR}/init-pip-cuda-11.8.sh` as set above (but substitute variables manually in the form).
- **Spark**
- **Spark config**
```
Expand Down Expand Up @@ -74,6 +50,8 @@ If you already have a Databricks account, you can run the example notebooks on a
```
LIBCUDF_CUFILE_POLICY=OFF
NCCL_DEBUG=INFO
SPARK_RAPIDS_ML_NO_IMPORT_ENABLED=0
```
If you wish to enable [no-import-change](../README.md#no-import-change) UX for the cluster, set `SPARK_RAPIDS_ML_NO_IMPORT_ENABLED=1` instead. The init script checks this cluster environment variable and modifies the runtime accordingly.
- Start the configured cluster.
- Select your workspace and upload the desired [notebook](../) via `Import` in the drop down menu for your workspace.
- Select your workspace and upload the desired [notebook](../) via `Import` in the drop down menu for your workspace. For the no-import-change UX, you can try the example [kmeans-no-import-change.ipynb](../kmeans-no-import-change.ipynb).
18 changes: 11 additions & 7 deletions notebooks/databricks/init-pip-cuda-11.8.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash
# Copyright (c) 2024, NVIDIA CORPORATION.
# Copyright (c) 2025, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -13,8 +13,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.

# set portion of path below after /dbfs/ to dbfs zip file location
SPARK_RAPIDS_ML_ZIP=/dbfs/path/to/zip/file
set -ex

# IMPORTANT: specify RAPIDS_VERSION fully 23.10.0 and not 23.10
# also in general, RAPIDS_VERSION (python) fields should omit any leading 0 in month/minor field (i.e. 23.8.0 and not 23.08.0)
# while SPARK_RAPIDS_VERSION (jar) should have leading 0 in month/minor (e.g. 23.08.2 and not 23.8.2)
Expand All @@ -39,12 +39,16 @@ ln -s /usr/local/cuda-11.8 /usr/local/cuda
/databricks/python/bin/pip install cudf-cu11~=${RAPIDS_VERSION} \
cuml-cu11~=${RAPIDS_VERSION} \
cuvs-cu11~=${RAPIDS_VERSION} \
pylibraft-cu11~=${RAPIDS_VERSION} \
rmm-cu11~=${RAPIDS_VERSION} \
--extra-index-url=https://pypi.nvidia.com

# install spark-rapids-ml
python_ver=`python --version | grep -oP '3\.[0-9]+'`
unzip ${SPARK_RAPIDS_ML_ZIP} -d /databricks/python3/lib/python${python_ver}/site-packages
/databricks/python/bin/pip install spark-rapids-ml

# set up no-import-change for cluster if enabled
if [[ $SPARK_RAPIDS_ML_NO_IMPORT_ENABLED == 1 ]]; then
echo "enabling no import change in cluster" 1>&2
sed -i /databricks/python_shell/dbruntime/monkey_patches.py -e '1 s/\(.*\)/import spark_rapids_ml.install\n\1/g'
fi



Loading