Skip to content

Latest commit

 

History

History
350 lines (264 loc) · 28 KB

File metadata and controls

350 lines (264 loc) · 28 KB

SIMPHERA Reference Architecture for Azure

This repository contains the reference architecture of the infrastructure needed to deploy dSPACE SIMPHERA to the Azure Public Cloud. It does not contain the helm chart needed to deploy SIMPHERA itself, but only the base infrastructure such as Kubernetes, PostgreSQL, storage accounts, etc.

You can use the reference architecture as a starting point for your SIMPHERA installation if you plan to deploy SIMPHERA to Azure. You can use the reference architecture as is and only have to configure few individual values. If you have special requirements feel free to adapt the architecture to your needs. For example, the reference architecture does not contain any kind of VPN connection to a private, on-premise network because this is highly specific. But the reference architecture is configured in such a way that the ingress points are available in the public internet.

Using the reference architecture you can deploy a single or even multiple instances of SIMPHERA, e.g. one for production and one for testing.

Terraform

This reference architecture is provided as a Terraform configuration. Terraform is an open-source command line tool to automatically create and manage cloud resources. A Terraform configuration consists of various .tf text files. These files contain the specifications of the resources to be created in the cloud infrastructure. That is the reason why this approach is called infrastructure-as-code. The main advantage of this approach is reproducibility becaue the configuration can be mainted in a source control system such as Git.

Variables

Terraform uses variables to make the specification configurable. The concrete values for these variables are specified in .tfvars files. So it is the task of the administrator to fill the .tfvars files with the correct values. This is explained in more detail in a later chapter.

State

Terraform has the concept of a state. On the one hand side there are the resource specifications in the .tf files. On the other hand there are the resources in the cloud infrastructure that are created based on these files. Terraform needs to store mapping information which element of the specification belongs to which resource in the cloud infrastructure. This mapping is called the state. In general you could store the state on your local hard drive. But that is not a good idea because in that case nobody else could change some settings and apply these changes. Therefore the state itself should be stored in the cloud.

So you need to manually create a storage account in Azure before you can start using Terraform. This is explained in more detail in the section Prerequisites.

Overview

As mentioned before, the reference architecture is defined as a Terraform configuration. It has been tested with Terraform version v1.0.0.

The following figure shows the main resources of the architecture. The figure shows using Azure Database for PostgreSQL and Private Link. This configuration is recommended by dSPACE. The general purpose tier for the Postgresql server is required to use the private link. The reference architecture also supports the basic tier. In this case, instead of the private link, firewall rules are used that only allow access from within the Kubernetes cluster.

SIMPHERA Reference Architecture fur Azure

Prerequisites

Before you start you need an Azure subscription and the contributor role to create the resources needed for SIMPHERA. Additionally, you need to create the following resources that are not part of this Terraform configuration:

  • Storage Account: A storage account with Performance set to standard and account kind set to StorageV2 (general purpose v2) is needed to store the Terraform state. You also have to create a container for the state inside the storage account.
  • KeyVault: The credentials of the PostgreSQL servers and the keys to encrypt the disks of the virtual machine for the license server must be stored in an Azure KeyVault. The KeyVault is not managed by Terraform and has to be created manually (see Azure KeyVault section).
  • Log Analytics Workspace (optional): In order to store the log data of the services you have to provide such a workspace inside your subscription.

On your administration PC you need to install the Terraform command, the Azure CLI and ssh-keygen which is typically available on most operating systems.

Authentication

To login to Azure, use:

az login

To switch to the correct subscription you can use the following command:

az account set --subscription "My Subscription"

Clone Repository

If you did not already clone this Git repository please clone it now to your local administration PC.

SSH Keys

In order to be able to connect to the Kubernetes nodes using ssh you need to create private ssh keys. You have to create such keys by executing the following command in the root folder:

# bash
ssh-keygen -t rsa -b 2048 -f shared-ssh-key/ssh -q -N ""

# Powershell
ssh-keygen -t rsa -b 2048 -f shared-ssh-key/ssh -q -N """"

How to get secrets and keys

To get a list of all postgresql passwords run the following command:

$secretnames = terraform output -json secretnames | ConvertFrom-Json
$keyvaultname = terraform output -json key_vault_name
$postgresql_passwords = @{}
foreach($prop in $secretnames.PsObject.Properties)
{
    $secret = az keyvault secret show --name $prop.value --vault-name $keyvaultname | ConvertFrom-Json
    $value = $secret.value | ConvertFrom-Json
    $postgresql_passwords[$prop.name] = ConvertTo-SecureString $value.postgresql_password -AsPlainText -Force
    Write-Host "The value of $($prop.value) secret for $($prop.name) instance is $value"
    Remove-Variable secret
    Remove-Variable value
}

To get list of all storage account keys run the following command:

$access_keys = @{}
$storageaccounts = terraform output -json minio_storage_usernames | ConvertFrom-Json
foreach($prop in $storageaccounts.PsObject.Properties)
{
  $keys = az storage account keys list -n $prop.value | ConvertFrom-Json
  $access_keys[$prop.name] = ConvertTo-SecureString $keys[0].value -AsPlainText -Force
  Write-Host "The value of $($prop.value) key for $($prop.name) instance is $(ConvertFrom-SecureString $access_keys[$prop.name] -AsPlainText)"
  Remove-Variable keys
}

Log Analytics Workspace

As mentioned before in order to store the log data of the services you have to provide such a workspace in your subscription.

To create Log analytics workspace, use:

az monitor log-analytics workspace create --workspace-name "<LogAnalyticsWorkspaceName>" --resource-group "<LogAnalyticsWorkspaceResourceGroup>" --location "<Location>"
  • LogAnalyticsWorkspaceName - Name of the Log Analytics Workspace
  • LogAnalyticsWorkspaceResourceGroup - Name of the Log Analytics Workspace resource group
  • Location - Location of the Log Analytics Workspace, eg. westeurope

State

As mentioned before Terraform stores the state of the resources it creates within a container of an Azure storage account. Therefore, you need to specify this location.

To do so, please make a copy of the file state-backend-template, name it state-backend.tf and open the file in a text editor. The values have to point to an existing storage account to be used to store the Terraform state:

  • resource_group_name: The name of the resource group your storage account is located in.
  • storage_account_name: The name of the storage account.
  • container_name: The name of the container inside the storage account to be used to store the terraform state. You need to create this container manually.
  • key: The name of the file to be used inside the container to be used for this terraform state.
  • environment: Use the value public for the general Azure cloud.

Configuration

For your configuration, please make a copy of the file terraform.tfvars.example, name it terraform.tfvars and open the file in a text editor. This file contains all variables that are configurable including documentation of the variables. Please adapt the values before you deploy the resources. List with description of all mandatory and optional variables could be find in the Inputs part of this readme file. It is recommended to restrict the access to the Kubernetes API server using authorized IP address ranges by setting the variable apiServerAuthorizedIpRanges. It is recommended to restrict the access to the Key Vault using authorized IP address ranges by setting the variable keyVaultAuthorizedIpRanges.

GPU Usage

If you use AURELION with SIMPHERA then the AURELION Pods are executed in the GPU node pool. AURELION uses a specific OptiX Version and thus needs specific NVIDIA Drivers. NVIDIA provides the gpu-operator, a tool with which it is possible to use containerized drivers inside pods. This makes it possible to use the needed driver Versions independent of the default installation of the NVIDIA Drivers on the GPU node pool, which can only be not installed, selecting a version is not possible. Further infomations

Scale Down Mode

Typically, you have autoscaling enabled for the GPU node pool so that VMs are scaled down if they are no longer needed. However, the AURELION container image is big and it takes time to download the image to the Kubernetes node. Depending on your location this can take more than 30 minutes. To shorten these times the Scale Down Mode of the GPU node pool should be set to Deallocate. That means, that a GPU VM is not deleted but only deallocated. So you no longer have to pay for the compute resources but only for the disk that will not be deleted when using this mode.

You can enable and disable this mode using the variables linuxExecutionNodeDeallocate and gpuNodeDeallocate. That means, you can not only configure this for the GPU node pool but also for the Execution node pool. As a default Deallocate is used for both node pools.

Deployment

Before you can deploy the resources to Azure you have to initialize Terraform:

terraform init

Afterwards you can deploy the resources:

terraform apply

Terraform automatically loads the variables from your terraform.tfvars variable definition file.

MinIO Storage

For each configured SIMPHERA instance an individual Azure storage account is created to store binary artifacts. The name of the storage account is a concatenation of the infrastructurename and the instancename, where hyphens are removed and which is clipped to a maximum of 24 characters. Please open the Azure Portal and navigate to the storage account which is located inside the resource group <instancename>-storage. Later during the configuration of the SIMPHERA Helm Chart you need the name of this storage account and also an Access Key that is also accessible from the portal.

Kubernetes

This deployment contains a managed Kubernetes cluster (AKS). In order to use command line tools such as kubectl or helm you need a kubeconfig configuration file. This file will automatically be exported by Terraform under the filename <infrastructurename>.kubeconfig.

If you want to ssh into a Kubernetes worker node you can use a command like this:

ssh -i shared-ssh-key/ssh simphera@<name-or-ip-of-node>

But please keep in mind that the nodes themselves do not get public IPs. Therefore you may need to create a Linux jumpbox VM within your virtual network to be able to connect to a node from there. In that case you have to copy the private key to that machine and have to set the correct file access: chmod 600 shared-ssh-key/ssh. As an alternative you can use the License Server Windows VM as jumpbox.

Azure Policy

This reference architecture deploys Azure Policy into the Kubernetes cluster. With Azure Policy, security policies can be defined and violations monitored. Azure provides various predefined policies. By default, no policies are assigned to the Kubernetes cluster using the reference architecture. Instead, an administrator must assign policies manually which requires appropriate permissions. The Azure built-in roles Resource Policy Contributor and Owner have these permissions. Using the predefined policy Kubernetes cluster containers should only use allowed images is recommended by dSPACE. To do this, use the CLI command below:

$clustername = "<cluster name>"
$resourcegroup = "<cluster resource group>"
$cluster = az aks show --name $clustername --resource-group $resourcegroup | ConvertFrom-Json
$name = "K8sAzureContainerAllowedImages@${clustername}"
$description = "Kubernetes cluster containers should only use allowed images"
$scope = $cluster.id
$policy = "febd0533-8e55-448f-b837-bd0e06f16469"
$allowedContainerImagesRegex = "^(docker\.io\/(groundnuty|jboss|eclipse-mosquitto|bitnami)|quay\.io\/oauth2-proxy|registry\.dspace\.cloud|registry\.k8s\.io)\/.+$"
$params_ = @"
{
  "allowedContainerImagesRegex": {
    "value": "$allowedContainerImagesRegex" 
  }
}
"@
$params_ = $params_ -replace '\s',''
$params = $params_ -replace '([\\]*)"', '$1$1\"'
az policy assignment create `
  --scope $scope `
  --description $description `
  --name $name `
  --policy $policy `
  --params $params

Delete Resources

To delete all resources you have to execute the following command:

terraform destroy

Please keep in mind that this command will also delete all storage accounts including your backups. So please be careful.

Next steps

As a next step you have to deploy SIMPHERA to the Kubernetes cluster by using the SIMPHERA Quick Start helm chart. You will find detailed instructions in the README file inside the Helm chart itself.

List of tools with versions needed for Simphera reference architecture deployment

Tool name Version
Azure CLI >=2.40.0
Helm >=3.8.0
Terraform >=1.2.9
kubectl >=1.27.0

Requirements

Name Version
terraform >= 1.0.0

Providers

Name Version
azurerm n/a
local n/a
random n/a

Modules

Name Source Version
simphera_instance ./modules/simphera_instance n/a

Resources

Name Type
azurerm_bastion_host.bastion-host resource
azurerm_key_vault.simphera-key-vault resource
azurerm_key_vault_key.azure-disk-encryption resource
azurerm_key_vault_secret.license-server-secret resource
azurerm_kubernetes_cluster.aks resource
azurerm_kubernetes_cluster_node_pool.execution-nodes resource
azurerm_kubernetes_cluster_node_pool.gpu-execution-nodes resource
azurerm_network_interface.license-server-nic resource
azurerm_network_interface_security_group_association.ni-license-server-sga resource
azurerm_network_security_group.license-server-nsg resource
azurerm_private_dns_zone.keyvault-privatelink-dns-zone resource
azurerm_private_dns_zone.minio-privatelink-dns-zone resource
azurerm_private_dns_zone.postgresql-privatelink-dns-zone resource
azurerm_private_dns_zone_virtual_network_link.keyvault-privatelink-network-link resource
azurerm_private_dns_zone_virtual_network_link.minio-privatelink-network-link resource
azurerm_private_dns_zone_virtual_network_link.postgresql-privatelink-network-link resource
azurerm_private_endpoint.keyvault-private-endpoint resource
azurerm_public_ip.bastion-pubip resource
azurerm_resource_group.aks resource
azurerm_resource_group.bastion resource
azurerm_resource_group.keyvault resource
azurerm_resource_group.license-server resource
azurerm_resource_group.network resource
azurerm_subnet.bastion-subnet resource
azurerm_subnet.default-node-pool-subnet resource
azurerm_subnet.execution-nodes-subnet resource
azurerm_subnet.gpu-nodes-subnet resource
azurerm_subnet.license-server-subnet resource
azurerm_subnet.paas-services-subnet resource
azurerm_virtual_machine_extension.azureDiskEncryption resource
azurerm_virtual_machine_extension.gc resource
azurerm_virtual_machine_extension.iaaSAntimalware resource
azurerm_virtual_machine_extension.microsoftMonitoringAgent resource
azurerm_virtual_network.simphera-vnet resource
azurerm_windows_virtual_machine.license-server resource
local_file.kubeconfig resource
random_password.license-server-password resource
azurerm_client_config.current data source
azurerm_log_analytics_workspace.log-analytics-workspace data source
azurerm_public_ip.aks_outgoing data source

Inputs

Name Description Type Default Required
apiServerAuthorizedIpRanges List of authorized IP address ranges that are granted access to the Kubernetes API server, e.g. ["198.51.100.0/24"] set(string) null no
gpuNodeCountMax The maximum number of nodes for gpu job execution number 12 no
gpuNodeCountMin The minimum number of nodes for gpu job execution number 0 no
gpuNodeDeallocate Configures whether the nodes for the gpu job execution are 'Deallocated (Stopped)' by the cluster auto scaler or 'Deleted'. bool true no
gpuNodePool Specifies whether an additional node pool for gpu job execution is added to the kubernetes cluster bool false no
gpuNodeSize The machine size of the nodes for the gpu job execution string "Standard_NC16as_T4_v3" no
infrastructurename The name of the infrastructure. e.g. simphera-infra string n/a yes
keyVaultAuthorizedIpRanges List of authorized IP address ranges that are granted access to the Key Vault, e.g. ["198.51.100.0/24"] set(string) [] no
keyVaultPurgeProtection Specifies whether the Key vault purge protection is enabled. bool true no
kubernetesVersion The version of the AKS cluster. string "1.28.3" no
licenseServer Specifies whether a VM for the dSPACE Installation Manager will be deployed. bool false no
licenseServerIaaSAntimalware Specifies whether a IaaSAntimalware extension will be installed on license server VM. Depends on licenseServer variable. bool true no
licenseServerMicrosoftGuestConfiguration Specifies whether a Microsoft Guest configuration extension will be installed on license server VM. Depends on licenseServer variable. bool true no
licenseServerMicrosoftMonitoringAgent Specifies whether a MicrosoftMonitoringAgent extension will be installed on license server VM. Depends on licenseServer, logAnalyticsWorkspaceName and logAnalyticsWorkspaceResourceGroupName variables. bool true no
linuxExecutionNodeCountMax The maximum number of Linux nodes for the job execution number 10 no
linuxExecutionNodeCountMin The minimum number of Linux nodes for the job execution number 0 no
linuxExecutionNodeDeallocate Configures whether the Linux nodes for the job execution are 'Deallocated (Stopped)' by the cluster auto scaler or 'Deleted'. bool true no
linuxExecutionNodeSize The machine size of the Linux nodes for the job execution string "Standard_D16s_v4" no
linuxNodeCountMax The maximum number of Linux nodes for the regular services number 12 no
linuxNodeCountMin The minimum number of Linux nodes for the regular services number 1 no
linuxNodeSize The machine size of the Linux nodes for the regular services string "Standard_D4s_v4" no
location The Azure location to be used. string n/a yes
logAnalyticsWorkspaceName The name of the Log Analytics Workspace to be used. Use empty string to disable usage of Log Analytics. string "" no
logAnalyticsWorkspaceResourceGroupName The name of the resource group of the Log Analytics Workspace to be used. string "" no
simpheraInstances A list containing the individual SIMPHERA instances, such as 'staging' and 'production'.
map(object({
name = string
minioAccountReplicationType = string
postgresqlVersion = string
postgresqlSkuName = string
postgresqlStorage = number
}))
n/a yes
ssh_public_key_path Path to the public SSH key to be used for the kubernetes nodes. string "shared-ssh-key/ssh.pub" no
tags The tags to be added to all resources. map(any) {} no

Outputs

Name Description
key_vault_id n/a
key_vault_name n/a
key_vault_uri n/a
kube_config n/a
minio_storage_usernames n/a
postgresql_server_hostnames n/a
postgresql_server_usernames n/a
secretnames n/a