Submit horovodjob as training job.
Submit horovodjob as training job.
arena submit horovodjob [flags]
-a, --annotation stringArray the annotations
--cpu string the cpu resource to use for the training, like 1 for 1 core.
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
-e, --env stringArray the environment variables
--gpus int the GPU count of each worker to run the training.
-h, --help help for horovodjob
--image string the docker image name of training job
--memory string the memory resource to use for the training, like 1Gi.
--name string override name
--rdma enable RDMA
--retry int retry times.
--sshPort int ssh port.
--sync-image string the docker image of syncImage
--sync-mode string syncMode: support rsync, hdfs, git
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
--workers int the worker number to run the distributed training. (default 1)
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
- arena submit - Submit a job.