You must be signed in to change notification settings - Fork 49
Helpful SRE Information on CodeFlare Stack
- Replacing Images in MCAD operator or InstaScale operators
- Changing resources for MCAD operator or InstaScale operator - NOTE, ODH 2.1.0+ only!
- CodeFlare Cleanup steps
- Installation of CodeFlare with ODH 2.1.0
- Testing the CodeFlare components from the ODH
- RHODS Installation instructions
- RHODS Cleanup steps
Method to replace existing MCAD or InstaScale images. (NOTE: Even though this replaces the images, that doesn't mean the newer or older images work or are tested with the installed CodeFlare stack...)
kubectl edit mcads mcad
kubectl edit instascales instascale
and under spec: add something like this for MCAD:
controllerImage: quay.io/project-codeflare/mcad-controller:main-v1.30.0
or for InstaScale:
controllerImage: quay.io/project-codeflare/instascale-controller:v0.0.4
Edit the CR for either mcads or instascale like this:
kubectl edit mcads mcad
kubectl edit instascales instascale
And then add this under the spec section:
cpu: "1"
memory: 1G
cpu: "1"
memory: 1G
To completely clean up all the CodeFlare components after an install, follow these steps:
No appwrappers should be left running:
kubectl get appwrappers -A
If any are left, you'd want to delete them
Remove the notebook and notebook pvc:
kubectl delete notebook jupyter-nb-kube-3aadmin -n opendatahub kubectl delete pvc jupyterhub-nb-kube-3aadmin-pvc -n opendatahub
Remove the codeflare-stack kfdef
kubectl delete kfdef codeflare-stack -n opendatahub
Remove the CodeFlare Operator csv and subscription:
kubectl delete sub codeflare-operator -n openshift-operators kubectl delete csv `kubectl get csv -n opendatahub |grep codeflare-operator |awk '{print $1}'` -n openshift-operators
Remove the CodeFlare CRDs
kubectl delete crd instascales.codeflare.codeflare.dev mcads.codeflare.codeflare.dev schedulingspecs.mcad.ibm.com queuejobs.mcad.ibm.com
Install the "Fast" channel of the ODH operator (gets 2.1.0)
Install GA CodeFlare operator (currently gets v0.1.0)
Apply the following DSC For the latest 2.1.0 release:
kubectl apply -f - <<EOF
apiVersion: datasciencecluster.opendatahub.io/v1alpha1
kind: DataScienceCluster
app.kubernetes.io/created-by: opendatahub-operator
app.kubernetes.io/instance: default
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/name: datasciencecluster
app.kubernetes.io/part-of: opendatahub-operator
name: default
enabled: true
enabled: true
enabled: false
enabled: false
enabled: false
enabled: true
enabled: true
find the route for the dashboard oc get route -n opendatahub
Open up the dashboard, Click on: Data Science Projects --> Launch Jupyter --> Codeflare Notebook --> Start Server
In a Terminal, clone the codeflare-sdk git clone https://github.com/project-codeflare/codeflare-sdk.git
All the same from this point... (edited)
This item is if you wanted to run the ODH automated tests against Codeflare.
Note: You need to have the following components installed before you run the tests
- Logged into your OpenShift Cluster and have used
oc login
so you can run commands - ODH operator (Right now, ODH 1.8.0)
- CodeFlare Operator (Right now, CodeFlare 0.1.0)
- ODH kfdef applied:
oc apply -f https://raw.githubusercontent.com/opendatahub-io/odh-manifests/master/kfdef/odh-core.yaml -n opendatahub
- CodeFlare kfdef applied: Note, it could be either one of the two, depending on what you're intending to test:
oc apply -f https://raw.githubusercontent.com/opendatahub-io/distributed-workloads/main/codeflare-stack-kfdef.yaml -n opendatahub
oc apply -f https://raw.githubusercontent.com/opendatahub-io/odh-manifests/master/kfdef/codeflare-stack-kfdef.yaml -n opendatahub
Step 1. You need to download the peak testing suite:
git clone https://github.com/opendatahub-io/peak
Step 2. Change to that directory:
cd peak
Step 3. Initialize peak:
git submodule update --init
Step 4. Create a file with the branch you want to test. For example, for main you'd do this:
echo opendatahub-kubeflow nil https://github.com/opendatahub-io/odh-manifests.git master > master-list
(The format of this command is to list the repo and then the branch you want to test, so if you wanted to test against a branch different than master you'd do something like this:)
echo opendatahub-kubeflow nil https://github.com/anishasthana/odh-manifests.git dw_0.1.1 > anish-011-list
Step 5. Setup your test so the code is downloaded for you by running setup against the list you created in step 4:
./setup.sh -t master-list
Step 6. Run it, substituting the kubeadmin password below, like this:
OPENSHIFT_TESTUSER_NAME=kubeadmin OPENSHIFT_TESTUSER_PASS=sfsdfd-u2WMS-8L9VS-SrVCR ./run.sh codeflare-stack.sh
step 7. It'll fire off a notebook and you should see the mnist pod start. You can follow the logs to see where it's at here:
oc get pods -n opendatahub
And you can follow the log pod here:
oc logs -f mnistjob-cdjnbmll99swpc-0 -n opendatahub
Example output looks like:
[0]:Validating: 76%|███████▌ | 60/79 [00:02<00:00, 25.84it/s]
[0]:Epoch 3: 100%|██████████| 939/939 [01:02<00:00, 15.07it/s, loss=0.179, v_num=0, val_loss=0.145, val_acc=0.957]
[0]:Epoch 3: 100%|██████████| 939/939 [01:02<00:00, 15.07it/s, loss=0.179, v_num=0, val_loss=0.145, val_acc=0.957][0]:
[0]:Epoch 3: 0%| | 0/939 [00:00<?, ?it/s, loss=0.179, v_num=0, val_loss=0.145, val_acc=0.957]
[0]:Epoch 4: 0%| | 0/939 [00:00<?, ?it/s, loss=0.179, v_num=0, val_loss=0.145, val_acc=0.957][0]:
Note 1: Currently the tests aren't working due to min_worker
and max_worker
and gpu
You can fix this by
vi operator-tests/opendatahub-kubeflow/tests/resources/codeflare-stack/mnist_ray_mini.ipynb and change the line:
"cluster = Cluster(ClusterConfiguration(namespace='opendatahub', name='mnisttest', min_worker=2, max_worker=2, min_cpus=2, max_cpus=2, min_memory=4, max_memory=4, gpu=0, instascale=False))"
"cluster = Cluster(ClusterConfiguration(namespace='opendatahub', name='mnisttest', num_workers=2, min_cpus=2, max_cpus=2, min_memory=4, max_memory=4, num_gpus=0, instascale=False))"
Note 2: If your cluster is too slow, you can speed up the tests by reducing the epochs from 5 to 2 or 1 like this: vi operator-tests/opendatahub-kubeflow/tests/resources/codeflare-stack/mnist.py and change:
Note 3 You can find the ray dashboard with this command:
oc get route -n opendatahub |grep ray-dashboard
and then put that in your browser
step 8. When it's complete and successful, it'll look something like this:
Cleaning extra admin roles
clusterrole.rbac.authorization.k8s.io/admin removed: "kubeadmin"
clusterrole.rbac.authorization.k8s.io/kuberay-operator removed: "kubeadmin"
by in [INFO] No regex /root/peak/operator-tests/opendatahub-kubeflow/tests/scripts selected tests were
Now using project "opendatahub" on server "https://api.jimmed413.cp.fyre.ibm.com:6443".
./run.sh took 797 seconds
[INFO] Exiting with 0
Once it's up, find the dashboard UI by:
oc get route -n redhat-ods-applications |grep dash |awk '{print $2}'
Put the URL into a browser and if prompted, login with your OpenShift userid and password
Once you are on your dashboard, you can select "Launch application" on the Jupyter application. This will take you to your notebook spawner page.
To completely clean up all the CodeFlare components after an install, follow these steps:
No appwrappers should be left running:
oc get appwrappers -A
If any are left, you'd want to delete them
Remove the notebook and notebook pvc:
oc delete notebook jupyter-nb-kube-3aadmin -n rhods-notebooks oc delete pvc jupyterhub-nb-kube-3aadmin-pvc -n rhods-notebooks
Remove the clusterrole and clusterrolebindings that were added:
oc delete ClusterRoleBinding rhods-operator-scc oc delete ClusterRole rhods-operator-scc
Remove the codeflare-stack kfdef
oc delete kfdef codeflare-stack -n redhat-ods-applications
Remove the CodeFlare Operator csv and subscription:
oc delete sub codeflare-operator -n openshift-operators oc delete csv `kubectl get csv -n opendatahub |grep codeflare-operator |awk '{print $1}'` -n openshift-operators
Remove the CodeFlare CRDs
oc delete crd instascales.codeflare.codeflare.dev mcads.codeflare.codeflare.dev schedulingspecs.mcad.ibm.com queuejobs.mcad.ibm.com
If you're removing the RHODS kfdefs and operator, you'd want to do this:
7.1 Delete all the kfdefs: (Note, this can take awhile as it needs to stop all the running pods in redhat-ods-applications, redhat-ods-monitoring and rhods-notebooks)
oc delete kfdef rhods-anaconda rhods-dashboard rhods-data-science-pipelines-operator rhods-model-mesh rhods-nbc oc delete kfdef modelmesh-monitoring monitoring -n redhat-ods-monitoring oc delete kfdef rhods-notebooks -n rhods-notebooks
7.2 And then delete the subscription and the csv:
oc delete sub rhods-operator -n redhat-ods-operator oc delete csv `oc get csv -n redhat-ods-operator |grep rhods-operator |awk '{print $1}'` -n redhat-ods-operator