Skip to content

Latest commit

 

History

History
138 lines (109 loc) · 5.18 KB

1-tfjob-standalone.md

File metadata and controls

138 lines (109 loc) · 5.18 KB

Here is an example how you can use Arena for the machine learning training. It will download the source code from git url.

  1. the first step is to check the available resources
arena top node
NAME                                IPADDRESS      ROLE    GPU(Total)  GPU(Allocated)
i-j6c68vrtpvj708d9x6j0  192.168.1.116  master  0           0
i-j6c8ef8d9sqhsy950x7x  192.168.1.119  worker  1           0
i-j6c8ef8d9sqhsy950x7y  192.168.1.120  worker  1           0
i-j6c8ef8d9sqhsy950x7z  192.168.1.118  worker  1           0
i-j6ccue91mx9n2qav7qsm  192.168.1.115  master  0           0
i-j6ce09gzdig6cfcy1lwr  192.168.1.117  master  0           0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0%)

There are 3 available nodes with GPU for running training jobs.

2. Now we can submit a training job with arena, it will download the source code from github

# arena submit tf \
             --name=tf-git \
             --gpus=1 \
             --image=tensorflow/tensorflow:1.5.0-devel-gpu \
             --sync-mode=git \
             --sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
             "python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 10000 --data_dir=code/tensorflow-sample-code/data"
configmap/tf-git-tfjob created
configmap/tf-git-tfjob labeled
tfjob.kubeflow.org/tf-git created
INFO[0000] The Job tf-git has been submitted successfully
INFO[0000] You can run `arena get tf-git --type tfjob` to check the job status

the source code will be downloaded and extracted to the directory code/ of the working directory. The default working directory is /root, you can also specify by using --workingDir. Also, you may specify the branch you are pulling code from by addding --env GIT_SYNC_BRANCH=main to the paramasters while submitting the job. If you are using the private git repo, you can use the following command:

# arena submit tf \
             --name=tf-git \
             --gpus=1 \
             --image=tensorflow/tensorflow:1.5.0-devel-gpu \
             --syncMode=git \
             --syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
             --env=GIT_SYNC_USERNAME=yourname \
             --env=GIT_SYNC_PASSWORD=yourpwd \
             "python code/tensorflow-sample-code/tfjob/docker/mnist/main.py"

Notice: arena is using git-sync to sync up source code. You can set the environment variables defined in git-sync project.

3. List all the jobs

# arena list
NAME    STATUS   TRAINER  AGE  NODE
tf-git  RUNNING  tfjob    0s   192.168.1.120

4. Check the resource usage of the job

# arena top job
NAME    STATUS   TRAINER  AGE  NODE           GPU(Requests)  GPU(Allocated)
tf-git  RUNNING  TFJOB    17s  192.168.1.120  1              1


Total Allocated GPUs of Training Job:
1

Total Requested GPUs of Training Job:
1

5. Check the resource usage of the cluster

# arena top node
NAME                    IPADDRESS      ROLE    GPU(Total)  GPU(Allocated)
i-j6c68vrtpvj708d9x6j0  192.168.1.116  master  0           0
i-j6c8ef8d9sqhsy950x7x  192.168.1.119  worker  1           0
i-j6c8ef8d9sqhsy950x7y  192.168.1.120  worker  1           1
i-j6c8ef8d9sqhsy950x7z  192.168.1.118  worker  1           0
i-j6ccue91mx9n2qav7qsm  192.168.1.115  master  0           0
i-j6ce09gzdig6cfcy1lwr  192.168.1.117  master  0           0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
1/3 (33%)

6. Get the details of the specific job

# arena get tf-git
NAME    STATUS   TRAINER  AGE  INSTANCE               NODE
tf-git  RUNNING  TFJOB    5s   tf-git-tfjob-worker-0  192.168.1.120

7. Check logs

# arena logs tf-git
2018-07-22T23:56:20.841129509Z WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:119: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
2018-07-22T23:56:20.841211064Z Instructions for updating:
2018-07-22T23:56:20.841217002Z
2018-07-22T23:56:20.841221287Z Future major versions of TensorFlow will allow gradients to flow
2018-07-22T23:56:20.841225581Z into the labels input on backprop by default.
2018-07-22T23:56:20.841229492Z
...
2018-07-22T23:57:11.842929868Z Accuracy at step 920: 0.967
2018-07-22T23:57:11.842933859Z Accuracy at step 930: 0.9646
2018-07-22T23:57:11.842937832Z Accuracy at step 940: 0.967
2018-07-22T23:57:11.842941362Z Accuracy at step 950: 0.9674
2018-07-22T23:57:11.842945487Z Accuracy at step 960: 0.9693
2018-07-22T23:57:11.842949067Z Accuracy at step 970: 0.9687
2018-07-22T23:57:11.842952818Z Accuracy at step 980: 0.9688
2018-07-22T23:57:11.842956775Z Accuracy at step 990: 0.9649
2018-07-22T23:57:11.842961076Z Adding run metadata for 999

8. More information about the training job in the logviewer

# arena logviewer tf-git
Your LogViewer will be available on:
192.168.1.120:8080/tfjobs/ui/#/default/tf-git-tfjob

Congratulations! You've run the first training job with arena successfully.