Skip to content

Helpful SRE Information on CodeFlare Stack

James Busche edited this page Sep 1, 2023 · 28 revisions

Table of Contents

Replacing Images in MCAD operator or InstaScale operators

Method to replace existing MCAD or InstaScale images. (NOTE: Even though this replaces the images, that doesn't mean the newer or older images work or are tested with the installed CodeFlare stack...)

kubectl edit mcads mcad

or

kubectl edit instascales instascale

and under spec: add something like this for MCAD:

spec:
  controllerImage: quay.io/project-codeflare/mcad-controller:main-v1.30.0

or for InstaScale:

spec:
  controllerImage: quay.io/project-codeflare/instascale-controller:v0.0.4

Changing resources for MCAD operator or InstaScale operator - NOTE, ODH 2.1.0+ only!

Edit the CR for either mcads or instascale like this:

kubectl edit mcads mcad

or

kubectl edit instascales instascale

And then add this under the spec section:

  controllerResources:
    limits:
      cpu: "1"
      memory: 1G
    requests:
      cpu: "1"
      memory: 1G

CodeFlare Cleanup steps

To completely clean up all the CodeFlare components after an install, follow these steps:

  1. No appwrappers should be left running:

    kubectl get appwrappers -A

    If any are left, you'd want to delete them

  2. Remove the notebook and notebook pvc:

    kubectl delete notebook jupyter-nb-kube-3aadmin -n opendatahub
    kubectl delete pvc jupyterhub-nb-kube-3aadmin-pvc -n opendatahub
  3. Remove the codeflare-stack kfdef

    kubectl delete kfdef codeflare-stack -n opendatahub
  4. Remove the CodeFlare Operator csv and subscription:

    kubectl delete sub codeflare-operator -n openshift-operators
    kubectl delete csv `kubectl get csv -n opendatahub |grep codeflare-operator |awk '{print $1}'` -n openshift-operators
  5. Remove the CodeFlare CRDs

    kubectl delete crd instascales.codeflare.codeflare.dev mcads.codeflare.codeflare.dev schedulingspecs.mcad.ibm.com queuejobs.mcad.ibm.com

Installation of CodeFlare with ODH 2.1.0

  1. Install the "Fast" channel of the ODH operator (gets 2.1.0)

  2. Install GA CodeFlare operator (currently gets v0.1.0)

  3. Apply the following DSC For the latest 2.1.0 release:

kubectl apply -f - <<EOF
apiVersion: datasciencecluster.opendatahub.io/v1alpha1
kind: DataScienceCluster
metadata:
  labels:
    app.kubernetes.io/created-by: opendatahub-operator
    app.kubernetes.io/instance: default
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: datasciencecluster
    app.kubernetes.io/part-of: opendatahub-operator
  name: default
spec:
  components:
    codeflare:
      enabled: true
    dashboard:
      enabled: true
    datasciencepipelines:
      enabled: false
    kserve:
      enabled: false
    modelmeshserving:
      enabled: false
    ray:
      enabled: true
    workbenches:
      enabled: true
EOF
  1. find the route for the dashboard oc get route -n opendatahub

  2. Open up the dashboard, Click on: Data Science Projects --> Launch Jupyter --> Codeflare Notebook --> Start Server

  3. In a Terminal, clone the codeflare-sdk git clone https://github.com/project-codeflare/codeflare-sdk.git

All the same from this point... (edited)

Testing the CodeFlare components from the ODH

This item is if you wanted to run the ODH automated tests against Codeflare.
Note: You need to have the following components installed before you run the tests

  • Logged into your OpenShift Cluster and have used oc login so you can run commands
  • ODH operator (Right now, ODH 1.8.0)
  • CodeFlare Operator (Right now, CodeFlare 0.1.0)
  • ODH kfdef applied:
oc apply -f https://raw.githubusercontent.com/opendatahub-io/odh-manifests/master/kfdef/odh-core.yaml -n opendatahub
  • CodeFlare kfdef applied: Note, it could be either one of the two, depending on what you're intending to test:
oc apply -f https://raw.githubusercontent.com/opendatahub-io/distributed-workloads/main/codeflare-stack-kfdef.yaml -n opendatahub
or
oc apply -f https://raw.githubusercontent.com/opendatahub-io/odh-manifests/master/kfdef/codeflare-stack-kfdef.yaml -n opendatahub

Step 1. You need to download the peak testing suite:

git clone https://github.com/opendatahub-io/peak

Step 2. Change to that directory:

cd peak

Step 3. Initialize peak:

git submodule update --init

Step 4. Create a file with the branch you want to test. For example, for main you'd do this:

echo opendatahub-kubeflow nil https://github.com/opendatahub-io/odh-manifests.git master > master-list

(The format of this command is to list the repo and then the branch you want to test, so if you wanted to test against a branch different than master you'd do something like this:)

echo opendatahub-kubeflow nil https://github.com/anishasthana/odh-manifests.git dw_0.1.1 > anish-011-list

Step 5. Setup your test so the code is downloaded for you by running setup against the list you created in step 4:

./setup.sh -t master-list

Step 6. Run it, substituting the kubeadmin password below, like this:

OPENSHIFT_TESTUSER_NAME=kubeadmin OPENSHIFT_TESTUSER_PASS=sfsdfd-u2WMS-8L9VS-SrVCR ./run.sh codeflare-stack.sh

step 7. It'll fire off a notebook and you should see the mnist pod start. You can follow the logs to see where it's at here:

oc get pods -n opendatahub

And you can follow the log pod here:

oc logs -f mnistjob-cdjnbmll99swpc-0 -n opendatahub

Example output looks like:

[0]:Validating:  76%|███████▌  | 60/79 [00:02<00:00, 25.84it/s]
[0]:Epoch 3: 100%|██████████| 939/939 [01:02<00:00, 15.07it/s, loss=0.179, v_num=0, val_loss=0.145, val_acc=0.957]
[0]:
[0]:Epoch 3: 100%|██████████| 939/939 [01:02<00:00, 15.07it/s, loss=0.179, v_num=0, val_loss=0.145, val_acc=0.957][0]:
[0]:Epoch 3:   0%|          | 0/939 [00:00<?, ?it/s, loss=0.179, v_num=0, val_loss=0.145, val_acc=0.957]          
[0]:Epoch 4:   0%|          | 0/939 [00:00<?, ?it/s, loss=0.179, v_num=0, val_loss=0.145, val_acc=0.957][0]:

Note 1: Currently the tests aren't working due to min_worker and max_worker and gpu You can fix this by vi operator-tests/opendatahub-kubeflow/tests/resources/codeflare-stack/mnist_ray_mini.ipynb and change the line:

    "cluster = Cluster(ClusterConfiguration(namespace='opendatahub', name='mnisttest', min_worker=2, max_worker=2, min_cpus=2, max_cpus=2, min_memory=4, max_memory=4, gpu=0, instascale=False))"
to
    "cluster = Cluster(ClusterConfiguration(namespace='opendatahub', name='mnisttest', num_workers=2, min_cpus=2, max_cpus=2, min_memory=4, max_memory=4, num_gpus=0, instascale=False))"

Note 2: If your cluster is too slow, you can speed up the tests by reducing the epochs from 5 to 2 or 1 like this: vi operator-tests/opendatahub-kubeflow/tests/resources/codeflare-stack/mnist.py and change:

    max_epochs=5,
to
    max_epochs=2,

Note 3 You can find the ray dashboard with this command:

oc get route -n opendatahub |grep ray-dashboard

and then put that in your browser

step 8. When it's complete and successful, it'll look something like this:

Cleaning extra admin roles

clusterrole.rbac.authorization.k8s.io/admin removed: "kubeadmin"
clusterrole.rbac.authorization.k8s.io/kuberay-operator removed: "kubeadmin"
by in [INFO] No regex /root/peak/operator-tests/opendatahub-kubeflow/tests/scripts selected tests were
Now using project "opendatahub" on server "https://api.jimmed413.cp.fyre.ibm.com:6443".
./run.sh took 797 seconds
[INFO] Exiting with 0

RHODS Installation Instructions

https://github.com/red-hat-data-services/distributed-workloads/blob/main/rhods-installation.md

Once it's up, find the dashboard UI by:

oc get route -n redhat-ods-applications |grep dash |awk '{print $2}'

Put the URL into a browser and if prompted, login with your OpenShift userid and password

Once you are on your dashboard, you can select "Launch application" on the Jupyter application. This will take you to your notebook spawner page.

RHODS Cleanup steps

To completely clean up all the CodeFlare components after an install, follow these steps:

  1. No appwrappers should be left running:

    oc get appwrappers -A

    If any are left, you'd want to delete them

  2. Remove the notebook and notebook pvc:

    oc delete notebook jupyter-nb-kube-3aadmin -n rhods-notebooks
    oc delete pvc jupyterhub-nb-kube-3aadmin-pvc -n rhods-notebooks
  3. Remove the clusterrole and clusterrolebindings that were added:

    oc delete ClusterRoleBinding rhods-operator-scc
    oc delete ClusterRole rhods-operator-scc
    
  4. Remove the codeflare-stack kfdef

    oc delete kfdef codeflare-stack -n redhat-ods-applications
  5. Remove the CodeFlare Operator csv and subscription:

    oc delete sub codeflare-operator -n openshift-operators
    oc delete csv `kubectl get csv -n opendatahub |grep codeflare-operator |awk '{print $1}'` -n openshift-operators
  6. Remove the CodeFlare CRDs

    oc delete crd instascales.codeflare.codeflare.dev mcads.codeflare.codeflare.dev schedulingspecs.mcad.ibm.com queuejobs.mcad.ibm.com
  7. If you're removing the RHODS kfdefs and operator, you'd want to do this:

    7.1 Delete all the kfdefs: (Note, this can take awhile as it needs to stop all the running pods in redhat-ods-applications, redhat-ods-monitoring and rhods-notebooks)

    oc delete kfdef rhods-anaconda rhods-dashboard  rhods-data-science-pipelines-operator rhods-model-mesh  rhods-nbc
    oc delete kfdef  modelmesh-monitoring monitoring -n redhat-ods-monitoring
    oc delete kfdef rhods-notebooks -n rhods-notebooks
    

    7.2 And then delete the subscription and the csv:

    oc delete sub rhods-operator -n redhat-ods-operator
    oc delete csv `oc get csv -n redhat-ods-operator |grep rhods-operator |awk '{print $1}'` -n redhat-ods-operator