Azure MLOps RandomForestRegressor

Jorge Trivilin.

Welcome to the Azure MLOps Random Regressor project! This repository showcases a complete MLOps (Machine Learning Operations) pipeline implemented using Azure Machine Learning. The project includes data management, model training, evaluation, and deployment pipelines for both batch inference and real-time online endpoints.

Project Overview

This project demonstrates how to implement MLOps best practices using Azure Machine Learning. It includes:

Data Management: Organizing and registering datasets with Azure ML.
Model Training: Training a Random Regressor model using a defined pipeline.
Model Evaluation: Assessing model performance and registering the best model.
Deployment Pipelines: Deploying models for batch inference and creating real-time online endpoints.
Infrastructure as Code: Managing Azure resources using Terraform.
CI/CD Pipelines: Automating workflows with GitHub Actions.

Project Structure

azure-mlops-random-regressor/
│
├── .github/                           # GitHub Actions workflows
│   └── workflows/
│       ├── codeql.yml                 # Code quality and security analysis
│       ├── deploy-batch-endpoint-pipeline.yml  # Batch deployment workflow
│       ├── deploy-model-training-pipeline.yml   # Training pipeline workflow
│       ├── deploy-online-endpoint-pipeline.yml  # Online deployment workflow
│       └── tf-gha-deploy-infra.yml               # Infrastructure deployment workflow
├── data/                              # Datasets for training and inference
│   ├── taxi-batch.csv                 # Batch inference data
│   ├── taxi-data.csv                  # Training data
│   └── taxi-request.json              # Sample request for online inference
├── data-science/                      # Source code for data science workflows
│   ├── src/                           # Python scripts
│   │   ├── evaluate.py                # Model evaluation script
│   │   ├── prep.py                    # Data preparation script
│   │   ├── register.py                # Model registration script
│   │   └── train.py                   # Model training script
│   └── environment/                   # Environment definitions
│       └── train-conda.yml            # Conda environment for training
├── infrastructure/                    # Infrastructure as Code (Terraform)
│   ├── modules/                       # Terraform modules
│   │   ├── aml-workspace/             # Azure ML Workspace module
│   │   ├── application-insights/      # Application Insights module
│   │   ├── container-registry/        # Container Registry module
│   │   ├── data-explorer/             # Data Explorer module
│   │   ├── key-vault/                 # Key Vault module
│   │   ├── resource-group/            # Resource Group module
│   │   └── storage-account/           # Storage Account module
│   ├── aml_deploy.tf                  # Azure ML deployment configuration
│   ├── locals.tf                      # Local variables for Terraform
│   ├── main.tf                        # Main Terraform configuration
│   └── variables.tf                   # Variable definitions for Terraform
├── mlops/                             # MLOps pipelines and configurations
│   └── azureml/
│       ├── train/                     # Training pipeline configurations
│       └── deploy/                    # Deployment pipeline configurations
│           ├── batch/                 # Batch deployment configurations
│           │   ├── batch-deployment.yml
│           │   └── batch-endpoint.yml
│           └── online/                # Online deployment configurations
│               ├── online-deployment.yml
│               └── online-endpoint.yml
├── devops-pipelines/                  # DevOps pipeline definitions
├── config-infra-dev.yml               # Development infrastructure config
├── config-infra-prod.yml              # Production infrastructure config
├── deploy_batch.sh                    # Script to deploy batch endpoint
├── deploy_endpoint.sh                 # Script to deploy online endpoint
├── environment.yml                    # Global environment configuration
├── requirements.txt                   # Python dependencies
├── run_training.sh                    # Script to run the training pipeline
├── terraform.sh                       # Script to apply Terraform configurations
└── README.md                          # Project documentation

Prerequisites

Before setting up the project, ensure you have the following:

Azure Account: An active Azure subscription.
Azure CLI: Installed on your local machine. Install Azure CLI
GitHub Account: With GitHub Actions enabled.
Terraform: Installed if managing infrastructure as code. Install Terraform
Python 3.8+
Git: Installed for version control.

Setup

Fork the Repository:

Fork this repository to your own GitHub account.

Clone the Repository:

git clone https://github.com/your-username/azure-mlops-random-regressor.git
cd azure-mlops-random-regressor

Configure GitHub Secrets:

In your GitHub repository, navigate to Settings > Secrets and variables > Actions > New repository secret and add the following secrets:
- AZURE_CREDENTIALS: Azure Service Principal credentials in JSON format.
- AZURE_SUBSCRIPTION: Your Azure subscription ID.
- ARM_CLIENT_ID: Azure Service Principal client ID.
- ARM_CLIENT_SECRET: Azure Service Principal client secret.
- ARM_SUBSCRIPTION_ID: Your Azure subscription ID.
- ARM_TENANT_ID: Your Azure tenant ID.
Modify Configuration Files:

Update config-infra-dev.yml and config-infra-prod.yml with your specific environment settings as needed.

Pipelines

Training Pipeline

The training pipeline automates the process of data preparation, model training, evaluation, and registration.

Pipeline Configuration: mlops/azureml/train/pipeline.yml
Steps:
1. Data Preparation: Cleans and preprocesses the data.
2. Model Training: Trains a Random Regressor model.
3. Model Evaluation: Evaluates model performance.
4. Model Registration: Registers the model in Azure ML if it meets performance criteria.
Run the Training Pipeline:
```
./run_training.sh
```

Batch Deployment Pipeline

Deploys the trained model for batch inference.

Pipeline Configuration: mlops/azureml/deploy/batch/pipeline.yml
Deploy the Batch Endpoint:
```
./deploy_batch.sh
```

Online Deployment Pipeline

Deploys the trained model as an online endpoint for real-time inference.

Pipeline Configuration: mlops/azureml/deploy/online/pipeline.yml
Deploy the Online Endpoint:
```
./deploy_endpoint.sh
```

Infrastructure as Code

Azure infrastructure is managed using Terraform to ensure reproducibility and scalability.

Configuration Files: Located in the infrastructure/ directory.
Apply Infrastructure Configurations:
```
./terraform.sh
```
This script initializes Terraform, applies the configurations, and provisions the necessary Azure resources.

Continuous Integration/Continuous Deployment (CI/CD)

Automate workflows and deployments using GitHub Actions.

Workflow Files: Located in .github/workflows/
- tf-gha-deploy-infra.yml: Deploys Azure infrastructure using Terraform.
- deploy-model-training-pipeline.yml: Executes the training pipeline.
- deploy-online-endpoint-pipeline.yml: Deploys the online endpoint.
- deploy-batch-endpoint-pipeline.yml: Deploys the batch endpoint.
- codeql.yml: Performs code quality and security analysis using CodeQL.
Triggering Workflows:

Workflows are automatically triggered on specific events such as pushes, pull requests, or manual triggers.

Data

Manage and register datasets with Azure ML for consistent access across pipelines.

Directory: data/
- taxi-batch.csv: Data used for batch inference.
- taxi-data.csv: Primary training data.
- taxi-request.json: Sample request payload for online inference.
Registering Data with Azure ML:

Ensure datasets are properly registered and referenced in the training pipeline configuration (data.yml).

Model Development

Develop and manage model code within the data-science/ directory.

Source Code: data-science/src/
- prep.py: Prepares and cleans the data.
- train.py: Trains the Random Regressor model.
- evaluate.py: Evaluates model performance.
- register.py: Registers the trained model with Azure ML.
Environment Configuration: data-science/environment/train-conda.yml

Defines the Python environment required for training, including dependencies and versions.

Environment

Set up the Python environment for development and training.

Create Conda Environment:

conda env create -f data-science/environment/train-conda.yml

Activate the Environment:
```
conda activate <environment-name>
```
Install Additional Dependencies:
```
pip install -r requirements.txt
```

Contributing

Contributions are welcome! Follow these steps to contribute:

Fork the Repository:

Click the "Fork" button at the top right of this page to create your own fork.

Clone Your Fork:

git clone https://github.com/your-username/azure-mlops-random-regressor.git
cd azure-mlops-random-regressor

Create a New Branch:

git checkout -b feature/your-feature-name

Make Your Changes:

Implement your feature or bug fix.

Commit Your Changes:

git commit -m "Description of your changes"

Push to Your Fork:

git push origin feature/your-feature-name

Open a Pull Request:

Navigate to the original repository and open a pull request from your fork.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure MLOps RandomForestRegressor

Table of Contents

Project Overview

Project Structure

Prerequisites

Setup

Pipelines

Training Pipeline

Batch Deployment Pipeline

Online Deployment Pipeline

Infrastructure as Code

Continuous Integration/Continuous Deployment (CI/CD)

Data

Model Development

Environment

Contributing

License

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
data-science		data-science
data		data
infrastructure		infrastructure
mlops/azureml		mlops/azureml
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
config-infra-dev.yml		config-infra-dev.yml
config-infra-prod.yml		config-infra-prod.yml
deploy_batch.sh		deploy_batch.sh
deploy_endpoint.sh		deploy_endpoint.sh
environment.yml		environment.yml
read-yaml.yml		read-yaml.yml
requirements.txt		requirements.txt
run_training.sh		run_training.sh
terraform.sh		terraform.sh
tf-gha-install-terraform.yml		tf-gha-install-terraform.yml

jorge-trivilin/azure-mlops-random-regressor

Folders and files

Latest commit

History

Repository files navigation

Azure MLOps RandomForestRegressor

Table of Contents

Project Overview

Project Structure

Prerequisites

Setup

Pipelines

Training Pipeline

Batch Deployment Pipeline

Online Deployment Pipeline

Infrastructure as Code

Continuous Integration/Continuous Deployment (CI/CD)

Data

Model Development

Environment

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages