NVIDIA · eordentlich · Jan 29, 2025 · Jan 24, 2025 · Jan 24, 2025 · Jan 24, 2025
diff --git a/notebooks/README.md b/notebooks/README.md
@@ -34,13 +34,27 @@ To run notebooks using Spark local mode on a server with one or more NVIDIA GPUs
 8. **OPTIONAL**: If you have multiple GPUs in your server, replace the `CUDA_VISIBLE_DEVICES` setting in step 4 with a comma-separated list of the corresponding indices.  For example, for two GPUs use `CUDA_VISIBLE_DEVICES=0,1`.
 
 ## No import change
-In these notebooks, the GPU accelerated implementations of algorithms in Spark MLlib are enabled via import statements from the `spark_rapids_ml` package.   Alternatively, acceleration can also be enabled by executing the following import statement at the start of a notebook:
+In the default notebooks, the GPU accelerated implementations of algorithms in Spark MLlib are enabled via import statements from the `spark_rapids_ml` package.   
+
+Alternatively, acceleration can also be enabled by executing the following import statement at the start of a notebook:
 ```
 import spark_rapids_ml.install
 ```
-After executing a cell with this command, all subsequent imports and accesses of supported accelerated classes from `pyspark.ml` will automatically redirect and return their counterparts in `spark_rapids_ml`.  Unaccelerated classes will import from `pyspark.ml` as usual.  Thus, with the above single import statement, all supported acceleration in an existing `pyspark` notebook is enabled with no additional import statement or code changes.  Directly importing from `spark_rapids_ml` also still works (needed for non-MLlib algorithms like UMAP).
+or by modifying the PySpark/Jupyter launch command above to use a CLI `pyspark-rapids` installed by our `pip` package to start Jupyter with pyspark as follows: 
+```bash
+cd spark-rapids-ml/notebooks
+
+PYSPARK_DRIVER_PYTHON=jupyter \
+PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=0.0.0.0' \
+CUDA_VISIBLE_DEVICES=0 \
+pyspark-rapids --master local[12] \
+--driver-memory 128g \
+--conf spark.sql.execution.arrow.pyspark.enabled=true
+``` 
+
+After executing either of the above, all subsequent imports and accesses of supported accelerated classes from `pyspark.ml` will automatically redirect and return their counterparts in `spark_rapids_ml`.  Unaccelerated classes will import from `pyspark.ml` as usual.  Thus, all supported acceleration in an existing `pyspark` notebook is enabled with no additional import statement or code changes.  Directly importing from `spark_rapids_ml` also still works (needed for non-MLlib algorithms like UMAP).
 
-For an example, see the notebook [kmeans-no-import-change.ipynb](kmeans-no-import-change.ipynb).
+For an example notebook, see the notebook [kmeans-no-import-change.ipynb](kmeans-no-import-change.ipynb).
 
 *Note*: As of this release, in this mode, the remaining unsupported methods and attributes on accelerated classes and objects will still raise exceptions.
 

diff --git a/notebooks/aws-emr/README.md b/notebooks/aws-emr/README.md
@@ -15,25 +15,22 @@ If you already have a AWS EMR account, you can run the example notebooks on an E
   export S3_BUCKET=<your_s3_bucket_name>
   aws s3 mb s3://${S3_BUCKET}
   ```
-- Create a zip file for the `spark-rapids-ml` package.
+- Upload the initialization script to S3.
   ```
-  cd spark-rapids-ml/python/src
-  zip -r spark_rapids_ml.zip spark_rapids_ml
-  ```
-- Upload the zip file and the initialization script to S3.
-  ```
-  aws s3 cp spark_rapids_ml.zip s3://${S3_BUCKET}/spark_rapids_ml.zip
-  cd ../../notebooks/aws-emr
-  aws s3 cp init-bootstrap-action.sh s3://${S3_BUCKET}/init-bootstrap-action.sh
+  aws s3 cp init-bootstrap-action.sh s3://${S3_BUCKET}/
   ```
 - Print out available subnets in CLI then pick a SubnetId (e.g. subnet-0744566f of AvailabilityZone us-east-2a).
 
   ```
   aws ec2 describe-subnets
   export SUBNET_ID=<your_SubnetId>
   ```
+
+  If this is your first time using EMR notebooks via EMR Studio and EMR Workspaces, we recommend creating a fresh VPC and subnets with internet access (the initialization script downloads artifacts) meeting the EMR requirements, per EMR documentation, and then specifying one of the new subnets in the above.
 
 - Create a cluster with at least two single-gpu workers. You will obtain a ClusterId in terminal. Noted three GPU nodes are requested here, because EMR cherry picks one node (either CORE or TASK) to run JupyterLab service for notebooks and will not use the node for compute.
+
+  If you wish to also enable [no-import-change](../README.md#no-import-change) UX for the cluster, change the init script argument `Args=[--no-import-enabled,0]` to `Args=[--no-import-enabled,1]` below.   The init script `init-bootstrap-action.sh` checks this argument and modifies the runtime accordingly.
 
   ```
   export CLUSTER_NAME="spark_rapids_ml"
@@ -50,24 +47,14 @@ If you already have a AWS EMR account, you can run the example notebooks on an E
   --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.2xlarge \
                     InstanceGroupType=CORE,InstanceCount=3,InstanceType=g4dn.2xlarge \
   --configurations file://${CUR_DIR}/init-configurations.json \
-  --bootstrap-actions Name='Spark Rapids ML Bootstrap action',Path=s3://${S3_BUCKET}/init-bootstrap-action.sh
+  --bootstrap-actions Name='Spark Rapids ML Bootstrap action',Path=s3://${S3_BUCKET}/init-bootstrap-action.sh,Args=[--no-import-enabled,0]
   ```
-- In the [AWS EMR console](https://console.aws.amazon.com/emr/), click "Clusters", you can find the cluster id of the created cluster. Wait until all the instances have the Status turned to "Running".
-- In the [AWS EMR console](https://console.aws.amazon.com/emr/), click "Workspace(Notebooks)", then create a workspace. Wait until the status becomes ready and a JupyterLab webpage will pop up. 
+- In the [AWS EMR console](https://console.aws.amazon.com/emr/), click "Clusters", you can find the cluster id of the created cluster. Wait until the cluster has the "Waiting" status. 
+- To use notebooks on EMR you will need an EMR Studio and an associated Workspace.   If you don't already have these, in the [AWS EMR console](https://console.aws.amazon.com/emr/), on the left, in the "EMR Studio" section, click the respective "Studio" and "Workspace (Notebooks)" links and follow instructions to create them.  When creating a Studio, select the `Custom` setup option to allow for configuring a VPC and a Subnet.  These should match the VPC and Subnet used for the cluster.  Select "\*Default\*" for all security group prompts and drop downs for Studio and Workspace setting.  Please check/search EMR documentation for further instructions. 
 
-- Enter the created workspace. Click the "Cluster" button (usually the top second button of the left navigation bar). Attach the workspace to the newly created cluster through cluster id.
+- In the "Workspace (Notebooks)" list of workspaces, select the created workspace, make sure it has the "Idle" status (select "Stop" otherwise in the "Actions" drop down) and click "Attach" to attach the newly created cluster through cluster id to the workspace.
 
-- Use the default notebook or create/upload a new notebook. Set the notebook kernel to "PySpark".  
+- Use the default notebook or create/upload a new notebook. Set the notebook kernel to "PySpark".  For the no-import-change UX, you can try the example [kmeans-no-import-change.ipynb](../kmeans-no-import-change.ipynb).
 
-- Add the following to a new cell at the beginning of the notebook. Replace "s3://path/to/spark\_rapids\_ml.zip" with the actual s3 path.  
-  ```
-  %%configure -f
-  {
-      "conf":{
-            "spark.submit.pyFiles": "s3://path/to/spark_rapids_ml.zip"
-      }
-  }
-
-  ```
 - Run the notebook cells.  
   **Note**: these settings are for demonstration purposes only.  Additional tuning may be required for optimal performance.
diff --git a/notebooks/aws-emr/init-bootstrap-action.sh b/notebooks/aws-emr/init-bootstrap-action.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2024, NVIDIA CORPORATION.
+# Copyright (c) 2025, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -35,8 +35,38 @@ sudo /usr/local/bin/pip3.10 install --upgrade pip
 sudo /usr/local/bin/pip3.10 install scikit-learn
 
 # install cudf and cuml
-sudo /usr/local/bin/pip3.10 install --no-cache-dir cudf-cu12 --extra-index-url=https://pypi.nvidia.com --verbose
-sudo /usr/local/bin/pip3.10 install --no-cache-dir cuml-cu12 cuvs-cu12 --extra-index-url=https://pypi.nvidia.com --verbose
-
+sudo /usr/local/bin/pip3.10 install --no-cache-dir cudf-cu12~=${RAPIDS_VERSION} \
+         cuml-cu12~=${RAPIDS_VERSION} \
+         cuvs-cu12~=${RAPIDS_VERSION} \
+         --extra-index-url=https://pypi.nvidia.com --verbose
+sudo /usr/local/bin/pip3.10 install spark-rapids-ml
 sudo /usr/local/bin/pip3.10 list
 
+# set up no-import-change for cluster if enabled
+if [[ $1 == "--no-import-enabled" && $2 == 1 ]]; then
+    echo "enabling no import change in cluster" 1>&2
+    cd /usr/lib/livy/repl_2.12-jars
+    sudo jar xf livy-repl_2.12*.jar fake_shell.py
+    sudo sed -i fake_shell.py -e '/from __future__/ s/\(.*\)/\1\ntry:\n    import spark_rapids_ml.install\nexcept:\n    pass\n/g'
+    sudo jar uf livy-repl_2.12*.jar fake_shell.py
+    sudo rm fake_shell.py
+fi 
+
+# ensure notebook comes up in python 3.10 by using a background script that waits for an 
+# application file to be installed before modifying.
+cat <<EOF >/tmp/mod_start_kernel.sh
+#!/bin/bash
+set -ex
+while [ ! -f /mnt/notebook-env/bin/start_kernel_as_emr_notebook.sh ]; do
+echo "waiting for /mnt/notebook-env/bin/start_kernel_as_emr_notebook.sh"
+sleep 10
+done
+echo "done waiting"
+sleep 10
+sudo sed -i /mnt/notebook-env/bin/start_kernel_as_emr_notebook.sh -e 's#"spark.pyspark.python": "python3"#"spark.pyspark.python": "/usr/local/bin/python3.10"#g'
+sudo sed -i /mnt/notebook-env/bin/start_kernel_as_emr_notebook.sh -e 's#"spark.pyspark.virtualenv.enabled": "true"#"spark.pyspark.virtualenv.enabled": "false"#g'
+exit 0
+EOF
+sudo bash /tmp/mod_start_kernel.sh &
+exit 0
+
diff --git a/notebooks/aws-emr/init-configurations.json b/notebooks/aws-emr/init-configurations.json
@@ -61,7 +61,6 @@
             "spark.rapids.sql.explain":"ALL",
             "spark.rapids.memory.gpu.reserve":"20",
             "spark.rapids.sql.python.gpu.enabled":"true",
-            "spark.rapids.memory.pinnedPool.size":"2G",
             "spark.rapids.sql.batchSizeBytes":"512m",
             "spark.locality.wait":"0",
             "spark.sql.execution.sortBeforeRepartition":"false",
@@ -70,6 +69,8 @@
             "spark.sql.cache.serializer":"com.nvidia.spark.ParquetCachedBatchSerializer",
             "spark.pyspark.python":"/usr/local/bin/python3.10",
             "spark.pyspark.driver.python":"/usr/local/bin/python3.10",
+            "spark.pyspark.virtualenv.enabled":"false",
+            "spark.yarn.appMasterEnv.PYSPARK_PYTHON":"/usr/local/bin/python3.10",
             "spark.dynamicAllocation.enabled":"false",
             "spark.driver.memory":"20g",
             "spark.rpc.message.maxSize":"512",

diff --git a/notebooks/databricks/README.md b/notebooks/databricks/README.md
@@ -7,44 +7,20 @@ If you already have a Databricks account, you can run the example notebooks on a
   export PROFILE=spark-rapids-ml
   databricks configure --token --profile ${PROFILE}
   ```
-- Create a zip file for the `spark-rapids-ml` package.
+- Copy the init scripts to your *workspace* (not DBFS) (ex. workspace directory: /Users/< databricks-user-name >/init_scripts).
   ```bash
-  cd spark-rapids-ml/python/src
-  zip -r spark_rapids_ml.zip spark_rapids_ml
-  ```
-- Copy the zip file to DBFS, setting `SAVE_DIR` to the directory of your choice.
-  ```bash
-  export SAVE_DIR="/path/to/save/artifacts"
-  databricks fs cp spark_rapids_ml.zip dbfs:${SAVE_DIR}/spark_rapids_ml.zip --profile ${PROFILE}
-  ```
-- Edit the [init-pip-cuda-11.8.sh](init-pip-cuda-11.8.sh) init script to set the `SPARK_RAPIDS_ML_ZIP` variable to the DBFS location used above.
-  ```bash
-  cd spark-rapids-ml/notebooks/databricks
-  sed -i"" -e "s;/path/to/zip/file;${SAVE_DIR}/spark_rapids_ml.zip;" init-pip-cuda-11.8.sh
+  export WS_SAVE_DIR="/path/to/directory/in/workspace"
+  databricks workspace mkdirs ${WS_SAVE_DIR} --profile ${PROFILE}
+  databricks workspace import --format AUTO --file init-pip-cuda-11.8.sh ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE}
   ```
-  **Note**: the `databricks` CLI requires the `dbfs:` prefix for all DBFS paths, but inside the spark nodes, DBFS will be mounted to a local `/dbfs` volume, so the path prefixes will be slightly different depending on the context.
-
-  **Note**: this init script does the following on each Spark node:
+  **Note**: the init script does the following on each Spark node:
   - updates the CUDA runtime to 11.8 (required for Spark Rapids ML dependencies).
   - downloads and installs the [Spark-Rapids](https://github.com/NVIDIA/spark-rapids) plugin for accelerating data loading and Spark SQL.
   - installs various `cuXX` dependencies via pip.
-
-- Copy the modified `init-pip-cuda-11.8.sh` init script to your *workspace* (not DBFS) (ex. workspace directory: /Users/< databricks-user-name >/init_scripts).
-  ```bash
-  export WS_SAVE_DIR="/path/to/directory/in/workspace"
-  databricks workspace mkdirs ${WS_SAVE_DIR} --profile ${PROFILE}
-  ```
-  For Mac
-  ```bash
-  databricks workspace import --format AUTO --content $(base64 -i init-pip-cuda-11.8.sh) ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE}
-  ```
-  For Linux
-  ```bash
-  databricks workspace import --format AUTO --content $(base64 -w 0 init-pip-cuda-11.8.sh) ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE}
-  ```
+  - if the cluster environment variable `SPARK_RAPIDS_ML_NO_IMPORT_ENABLED=1` is define (see below), the init script also modifies a Databricks notebook kernel startup script to enable no-import change UX for the cluster.  See [no-import-change](../README.md#no-import-change).
 - Create a cluster using **Databricks 13.3 LTS ML GPU Runtime** using at least two single-gpu workers and add the following configurations to the **Advanced options**.
   - **Init Scripts**
-    - add the workspace path to the uploaded init script, e.g. `${WS_SAVE_DIR}/init-pip-cuda-11.8.sh`.
+    - add the workspace path to the uploaded init script `${WS_SAVE_DIR}/init-pip-cuda-11.8.sh` as set above (but substitute variables manually in the form).
   - **Spark**
     - **Spark config**
       ```
@@ -74,6 +50,8 @@ If you already have a Databricks account, you can run the example notebooks on a
       ```
       LIBCUDF_CUFILE_POLICY=OFF
       NCCL_DEBUG=INFO
+      SPARK_RAPIDS_ML_NO_IMPORT_ENABLED=0
       ```
+      If you wish to enable [no-import-change](../README.md#no-import-change) UX for the cluster, set `SPARK_RAPIDS_ML_NO_IMPORT_ENABLED=1` instead.  The init script checks this cluster environment variable and modifies the runtime accordingly.
 - Start the configured cluster.
-- Select your workspace and upload the desired [notebook](../) via `Import` in the drop down menu for your workspace.
+- Select your workspace and upload the desired [notebook](../) via `Import` in the drop down menu for your workspace.  For the no-import-change UX, you can try the example [kmeans-no-import-change.ipynb](../kmeans-no-import-change.ipynb).
diff --git a/notebooks/databricks/init-pip-cuda-11.8.sh b/notebooks/databricks/init-pip-cuda-11.8.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2024, NVIDIA CORPORATION.
+# Copyright (c) 2025, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -13,8 +13,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# set portion of path below after /dbfs/ to dbfs zip file location
-SPARK_RAPIDS_ML_ZIP=/dbfs/path/to/zip/file
+set -ex
+
 # IMPORTANT: specify RAPIDS_VERSION fully 23.10.0 and not 23.10
 # also in general, RAPIDS_VERSION (python) fields should omit any leading 0 in month/minor field (i.e. 23.8.0 and not 23.08.0)
 # while SPARK_RAPIDS_VERSION (jar) should have leading 0 in month/minor (e.g. 23.08.2 and not 23.8.2)
@@ -39,12 +39,16 @@ ln -s /usr/local/cuda-11.8 /usr/local/cuda
 /databricks/python/bin/pip install cudf-cu11~=${RAPIDS_VERSION} \
     cuml-cu11~=${RAPIDS_VERSION} \
     cuvs-cu11~=${RAPIDS_VERSION} \
-    pylibraft-cu11~=${RAPIDS_VERSION} \
-    rmm-cu11~=${RAPIDS_VERSION} \
     --extra-index-url=https://pypi.nvidia.com
 
 # install spark-rapids-ml
-python_ver=`python --version | grep -oP '3\.[0-9]+'`
-unzip ${SPARK_RAPIDS_ML_ZIP} -d /databricks/python3/lib/python${python_ver}/site-packages
+/databricks/python/bin/pip install spark-rapids-ml
+
+# set up no-import-change for cluster if enabled
+if [[ $SPARK_RAPIDS_ML_NO_IMPORT_ENABLED == 1 ]]; then
+    echo "enabling no import change in cluster" 1>&2
+    sed -i /databricks/python_shell/dbruntime/monkey_patches.py -e '1 s/\(.*\)/import spark_rapids_ml.install\n\1/g'
+fi
+