The Vega is highly modularized. The search space, search algorithm, and pipeline can be built through configuration. To run the Vega application is to load the configuration file and complete the AutoML process according to the configuration, as shown in the following figure.
import vega
if __name__ == "__main__":"./main.yml")
The following describes the configuration items in the main.yml file in detail.
The configuration of the vega can be divided into two parts:
- The general configuration item is used to set common and common configuration items, such as the output path and log level.
- Pipeline configuration, including the following two parts:
- Pipeline definition. The configuration item name is pipeline, which is a list that contains all steps in the pipeline.
- Defines each step in Pipeline. The configuration item name is the name of each step defined in Pipeline.
# general configuration
# Defining a Pipeline.
pipeline: [my_nas, my_hpo, my_data_augmentation, my_fully_train]
# defines each step. Refer to the following sections for details about
# NAS configuration
# HPO configuration
# Data augmentation configuration
# fully train configuration
The following describes each configuration item in detail.
The following public configuration items can be configured:
Configuration Item | Description |
local_base_path | Working path. Each time when the system is running, a subfolder with time information (task id) is generated in the path. In this way, the output of multiple running is not overwritten. The task id subfolder contains two subfolders: output and worker. The output folder stores the output data of each step in the pipeline, and the worker folder stores temporary information. In the clustered scenario, this path needs to be set to an EFS path that can be accessed by each computing node, and is used by different nodes to share data. |
backup_base_path | Backup path. This parameter is used in the cloud channel environment or cluster environment. The output and task files in the local path are backed up to this path. |
timeout | Worker timeout interval, in hours. If the task is not completed within the interval, the worker is forcibly terminated. The unit is hour. The default value is 10. |
devices_per_job | Number of GPUs used by each worker in the search phase, -1 means that one worker uses all GPUs of the node, 1 means one worker uses one GPU, 2 means one worker uses two GPUs, and so on. |
logger.level | Log level, which can be set to debug | info | warn | error | critical. default level is info. |
cluster.master_ip | In the cluster scenario, this parameter needs to be set to the IP address of the master node. |
cluster.listen_port | In the cluster scenario, you need to pay attention to this parameter. If port 8000 is occupied, you need to adjust the monitoring port. |
cluster.slaves | In the cluster scenario, this parameter needs to be set to the IP address of other nodes except the master node. |
local_base_path: "./tasks"
backup_base_path: ~
devices_per_job: -1
level: info
master_ip: ~
listen_port: 8000
slaves: []
NAS configuration items include:
Configuration Item | Description |
pipe_step | Step type. The value is fixed to NasPipeStep. |
search_algorithm | Search algorithm configuration item. For details, see the configuration of each NAS algorithm. |
search_space | For details about the definition of the search space, see each NAS algorithm. |
trainer | Trainer configuration information. For details, see the Trainer Configuration. |
dataset | Dataset configuration. For details, see the Dataset Configuration. |
The NAS algorithm mentioned above includes: Prune-EA, Quant-EA,SM-NAS (Coming soon), CARS, Segmentation-Adelaide-EA, SR-EA, ESR-EA
The following is the configuration of the BackboneNas algorithm:
type: NasPipeStep
search_algorithm: # search algorithm configuration item. This item must be configured in steps such as NAS.
type: BackboneNas # Search algorithm type.
codec: BackboneNasCodec # The supported search algorithm codec is BackboneNas
policy: # For details about the supported search algorithms, see the related algorithm documents.
num_mutate: 10
random_ratio: 0.2
max_sample: 100
min_sample: 10
search_space: # search space in the related algorithm document, this item must be configured in steps such as Nas.
type: SearchSpace
modules: ['backbone', 'head'] # Modules are used to describe how to combine a network.
backbone: # Each module has a configuration item
ResNetVariant: # For details, see the description of each algorithm.
base_depth: [18, 34, 50, 101]
base_channel: [32, 48, 56, 64]
doublechannel: [3, 4]
downsample: [3, 4]
num_classes: [10]
type: Trainer
type: Cifar10
The optional models of the search_space configuration item are as follows:
module | Optional | Description | Algorithm Reference |
backbone | PruneResNet | ResNet variant network, which is used to support the prune operation. | ref |
backbone | QuantResNet | ResNet variant network, which is used to support quantization operations. | ref |
backbone | ResNetVariant | The ResNet variant network is used to support architecture adjustment operations such as down-sampling point adjustment. | ref |
head | LinearClassificationHead | Network classification layer used to classify tasks, which can be concatenated with ResNetVariant. | |
head | CurveLaneHead | The CurveLaneHead detection head is used to detect the lane. | |
neck | FeatureFusionModule | Indicates the feature dashamid network in the roadway detection task. | |
detector | AutoLaneDetector | AutoLaneDetector detection network in the roadway detection task. | |
super_network | DartsNetwork | Super network structure in the Darts algorithm. | ref |
super_network | CARSDartsNetwork | Super network structure in the CARS algorithm. | ref |
custom | AdelaideFastNAS | Indicates the user-defined network structure in the AdelaideFastNAS algorithm. | ref |
custom | MtMSR | Indicates the user-defined network structure in the MtMSR algorithm. | ref |
HPO refers to the optimization of model training running parameters. It does not involve network architecture parameters. The searchable items are as follows:
- Batch size of the dataset.
- Optimization method and related parameters.
- Learning rate.
- Momentum.
The HPO configuration items are as follows:
Configuration Item | Description |
pipe_step | The value is fixed at NasPipeStep. |
hpo | Configure the type and domain_space parameters. The former defines the HPO algorithm to be used. For details, see the HPO. The latter defines the hyperparameter information to be searched for. |
trainer | Trainer configuration information. For details, see the Trainer Configuration. |
dataset | Dataset configuration. For details, see the Dataset Configuration. |
evaluator | evaluator information. Please refer to each HPO algorithm example or Benchmark configuration. |
The HPO configuration of the ASHA algorithm is as follows for reference: |
type: NasPipeStep
type: AshaHpo
total_epochs: 81
config_count: 40
- key: dataset.batch_size
type: INT_CAT
range: [8, 16, 32, 64, 128, 256]
- key:
range: [0.00001, 0.1]
- key: trainer.optim.type
type: STRING
range: ['Adam', 'SGD']
- key: trainer.optim.momentum
type: FLOAT
range: [0.0, 0.99]
- key: condition_for_sgd_momentum
child: trainer.optim.momentum
parent: trainer.optim.type
type: EQUAL
range: ["SGD"]
modules: ["backbone", "head"]
base_channel: 64
downsample: [0, 0, 1, 0, 1, 0, 1, 0]
base_depth: 18
doublechannel: [0, 0, 1, 0, 1, 0, 1, 0]
name: ResNetVariant
num_classes: 10
name: LinearClassificationHead
base_channel: 512
type: Cifar10
type: Trainer
type: Evaluator
type: GpuEvaluator
type: accuracy
The configuration of data augmentation includes:
Configuration Item | Description |
pipe_step | The value is fixed at NasPipeStep. |
hpo | Currently, only the HYPERLINK "../algorithms/" PBA algorithm is supported. The value is fixed to PBAHpo. For details, see the PBA. |
trainer | Trainer configuration information. For details, see the Trainer Configuration. |
dataset | Dataset configuration. For details, see the Dataset Configuration. |
The following shows the configuration of the PBA algorithm for reference:
type: NasPipeStep
type: Cifar10
type: PBAHpo
each_epochs: 3
config_count: 16
total_rungs: 200
Cutout: True
Rotate: True
Translate_X: True
Translate_Y: True
Brightness: True
Color: True
Invert: True
Sharpness: True
Posterize: True
Shear_X: True
Solarize: True
Shear_Y: True
Equalize: True
AutoContrast: True
Contrast: True
type: Trainer
type: Evaluator
type: GpuEvaluator
type: accuracy
Full training is used to train network models. The configuration items are as follows:
Configuration Item | Description |
pipe_step | The value is fixed at FullyTrainPipeStep. |
models_folder | The directory where the model description file to be trained is located. The file name format in this directory is: model_desc_<ID>.json, where the ID is a number, and these models will be trained in parallel. This option is mutually exclusive with the parameter "model" and has priority to "model". |
trainer | Trainer configuration information. For details, see the Trainer Configuration. |
dataset | Dataset configuration. For details, see the Dataset Configuration. |
model | Model information. For details, see the Trainer Configuration. |
model_desc_file | The model description file |
type: FullyTrainPipeStep
# models_folder: ~
type: Trainer
model_desc_file: "/models/model_desc.json"
type: Cifar10
In each of the preceding pipeline steps, the configuration item trainer is provided. You can configure the basic trainer and extended trainer. The basic configuration information of the trainer is as follows:
Configuration Item | Description |
type | Trainer, or algorithm extension trainer. For details, see the related algorithm document. |
epochs | Total epochs |
optim | Optimizers and Parameters |
lr_scheduler | lr scheduler and parameters |
loss | Loss and Parameters |
metric | Metrics and Parameters |
distributed | Whether to enable Horovod for fully train. After enabling Horovod, the trainer will use all computing resources in the Horovod cluster to train the specified network model. To start horovod, you must set the model option. |
model_desc | Model description, which is mutually exclusive with model_desc_file. model_desc_file takes precedence over model_desc_file. And the parameter shuffle of the dataset must be set to False. |
model_desc_file | File where the model description information is located. This parameter is mutually exclusive with model_desc, and model_desc_file takes priority over model_desc_file. |
hps_file | Hyper-parameter file |
pretrained_model_file | Pre-trained model file |
The following is an example of loading the Torchvision model for training:
type: Trainer
epochs: 160
type: Adam
lr: 0.1
type: MultiStepLR
milestones: [75, 150]
gamma: 0.5
type: accuracy
type: CrossEntropyLoss
distributed: False
type: Imagenet
modules: ['backbone', 'head']
base_depth: [18, 34, 50, 101]
base_channel: [32, 48, 56, 64]
doublechannel: [3, 4]
downsample: [3, 4]
num_classes: [10]
# model_desc_file: ~
# hps_file: ~
# pretrained_model_file: ~
As shown in the preceding example, in addition to the models defined by Vega, you can also load the TorchVision Model. The following models are supported. For details, see the official desc.
module | Optional |
torch_vision_model | vgg11, vgg13, vgg16, vgg19, vgg11_bn, vgg13_bn, vgg16_bn, vgg19_bn, squeezenet1_0, squeezenet1_1, shufflenetv2_x0.5, shufflenetv2_x1.0, resnet18, resnet34, resnet50, resnet101, resnet152, resnext50_32x4d, resnext101_32x8d, wide_resnet50_2, wide_resnet101_2, mobilenet_v2, mnasnet0_5, mnasnet1_0, inception_v3_google, googlenet, densenet121, densenet169, densenet201, densenet161, alexnet, fasterrcnn_resnet50_fpn, fasterrcnn_resnet50_fpn_coco, keypointrcnn_resnet50_fpn_coco, maskrcnn_resnet50_fpn_coco, fcn_resnet101_coco, deeplabv3_resnet101_coco, r3d_18, mc3_18, r2plus1d_18 |
Each pipeline involves dataset configuration. Vega classifies datasets into three types: train, val, and test. The three types of datasets can be configured independently. In addition, transform can be configured in Dataset. The following is a configuration example of the Cifar10 dataset:
type: Cifar10
data_path: ~ # configuration data set is located.
batch_size: 256
num_workers: 4
imgs_per_gpu: 1
train_portion: 0.5
shuffle: false
distributed: false
- type: RandomCrop
size: 32
padding: 4
- type: RandomHorizontalFlip
- type: ToTensor
- type: Normalize
- 0.49139968
- 0.48215827
- 0.44653124
- 0.24703233
- 0.24348505
- 0.26158768
- type: ToTensor
- type: Normalize
- 0.49139968
- 0.48215827
- 0.44653124
- 0.24703233
- 0.24348505
- 0.26158768
- type: ToTensor
- type: Normalize
- 0.49139968
- 0.48215827
- 0.44653124
- 0.24703233
- 0.24348505
- 0.26158768
Vega provides the following common data sets:
Name | Description | Data Source |
Cifar10 | The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images | download |
Cifar100 | The CIFAR-100 is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class | download |
Minist | The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples | download |
COCO | The COCO is a large-scale object detection, segmentation, and captioning dataset, about 123K images and 886K instances | download |
Div2K | Div2K is a super-resolution architecture search database, containing 800 training images and 100 validiation images | download |
Imagenet | The ImageNet is an image database organized according to the WordNet hierarchy, in which each node of the hierarchy is depicted by hundreds and thousands of images | download |
Fmnist | Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. | download |
Cityscapes | The Cityscape is a large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames. | download |
Cifar10TF | The CIFAR-10-bin dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images | download |
Div2kUnpair | DIV2K dataset: DIVerse 2K resolution high quality images as used for the challenges @ NTIRE (CVPR 2017 and CVPR 2018) and @ PIRM (ECCV 2018) | download |
Cifar10 Default Configuration
data_path: ~ # the path of the dataset, default is None, MUST be set to a correct dataset PATH, such as /datasets/cifar10 batch_size: 256 # batch size num_workers: 4 # the worker number to load the data shuffle: false # if True, will shuffle, defaults to False distributed: false # whether to use distributed train train_portion: 0.5 # the ratio of the train data split from the initial train data
Cifar100 Default Configuration
data_path: ~ # the path of the dataset, default is None, MUST be set to a correct dataset PATH, such as /datasets/cifar100 batch_size: 1 # batch size num_workers: 4 # the worker number to load the data shuffle: true # if True, will shuffle, defaults to False distributed: false # whether to use distributed train
Cityscapes Default Configuration
root_path: ~ # the path of the dataset, default is None, MUST be set to a correct dataset PATH, such as /datasets/Cityscapes list_path: 'img_gt_train.txt' # the name of the txt file batch_size: 1 # batch size mean: 128 # the parameter mean for transform ignore_label: 255 # the label to ignore scale: True # if scale is true, the asptio will be keep when transform mirrow: True # whether to use mirrow for transform rotation: 90 # the rotation value crop: 321 # the crop size num_workers: 4 # the worker number to load the data shuffle: False # if True, will shuffle, defaults to False distributed: True # whether to use distributed train id_to_trainid: False # change the random id to continious id,if true, a dict should be obtain
DIV2K Default Configuration
root_HR: ~ # the path of the dataset, default is None, MUST be set to a correct dataset PATH, such as /datasets/DIV2K/div2k_train/hr root_LR: ~ # the path of the dataset, default is None, MUST be set to a correct dataset PATH, such as /datasets/DIV2k/div2k_train/lr batch_size: 1 # batch size num_workers: 4 # the worker number to load the data upscale: 2 # the upscale for super resolution subfile: !!null # whether to use subfile,Set it to None by default crop: !!null # the crop size,Set it to None by default shuffle: false # if True, will shuffle, defaults to False hflip: false # whether to use horrizion flip vflip: false # whether to use vertical flip rot90: false # whether to use rotation distributed: True # whether to use distributed train
FashionMnist Default Configuration
data_path: ~ # the path of the dataset, default is None, MUST be set to a correct dataset PATH, such as /datasets/fmnist batch_size: 1 # batch size num_workers: 4 # the worker number to load the data shuffle: true # if True, will shuffle, defaults to False distributed: false # whether to use distributed train
Imagenet Default Configuration
data_path: ~ # the path of the dataset, default is None, MUST be set to a correct dataset PATH, such as /datasets/ImageNet batch_size: 1 # batch size num_workers: 4 # the worker number to load the data shuffle: true # if True, will shuffle, defaults to False distributed: false # whether to use distributed train
Mnist Default Configuration
data_path: ~ # the path of the dataset, default is None, MUST be set to a correct dataset PATH, such as /datasets/mnist batch_size: 1 # batch size num_workers: 4 # the worker number to load the data shuffle: true # if True, will shuffle, defaults to False distributed: false # whether to use distributed train
Currently, the following transforms are supported:
Transform | Input | Output |
AutoContrast | level img | img |
BboxTransform | bboxes imge_shape scale_factor | bboxes |
Brightness | level img | img |
Color | level img | img |
Contrast | level img | img |
Cutout | length img | img |
Equalize | level img | img |
ImageTransform | scale img | img img_shape pad_shape scale_factor |
Invert | level img | img |
MaskTransform | masks pad_shape scale_factor | padded_masks |
Numpy2Tensor | numpy | tensor |
Posterize | level img | img |
RandomCrop_pair | crop upscale img label | img label |
RandomHorizontalFlip_pair | img label | img label |
RandomMirrow_pair | img label | img label |
RandomRotate90_pair | img label | img label |
RandomVerticallFlip_pair | img label | img label |
Rotate | level img | img |
SegMapTransform | scale img | img |
Sharpness | level img | img |
Shear_X | level img | img |
Shear_Y | level img | img |
Solarize | level img | img |
ToPILImage_pair | img1 img2 | img1 img2 |
ToTensor_pair | img1 img2 | tensor1 tensor2 |
Translate_X | level img | img |
Translate_Y | level img | img |