Unable to deploy a 'Compute Instance' User Resource to a Workspace AML Service #4151

dram1964 · 2024-11-20T17:13:40Z

Deployment of AML Compute Instance fails

When adding a Compute Instance to a TRE Workspace AML Service, the deployment fails with the following error:
desired number of dedicated nodes could not be allocated. This error has been happening consistently for the past
two days. Have not tried it before then with this version of the TRE.

This error occurs when deploying via:

the TRE UI using the 'aml_compute' user-resource template and
Logging into the AML Workspace from a workspace VM and trying to create new compute instance

Steps to reproduce

Create New Workspace and User
Add User to 'Workspace Owners'
Add User to 'Workspace Researchers'
Login to TRE UI with User account
Add Virtual Desktops (Guacamole) Service to Workspace
Add a User Resource (VM) to Virtual Desktops Service
Add Azure ML Service to Workspace ('expose externally' = False)
Add a Compute instance User Resource to AML Service

Additional Steps taken

Grant User 'Network Contributor' on the TRE Workspace VNet
Grant User 'AzureML Compute Operator' on the Workspace AML Workspace

Additional Info

there is sufficient quota for the selected compute-size in my deployment region
I have tried a number of different compute sizes
I have confirmed that there are free IP addresses in the AML Subnet.
all resources are deployed to the UK South Region.

Azure TRE release version: v0.19.1
tre-workspace-base: 1.5.7
tre-service-azureml: 0.8.11
tre-user-resource-aml-compute-instance: 0.5.7
deployment location: UKSouth

The text was updated successfully, but these errors were encountered:

tim-p-allen · 2024-11-21T10:18:18Z

Hi @dram1964, can you create an AML in the portal manually?

dram1964 · 2024-11-21T10:43:32Z

Hi @tim-allen-ck - logged-in as the Global Admin for the tenant, I've created an AML workspace with basic settings (public access) in the UK South region and added a compute which completed in 5 minutes or so. My efforts via the TRE usually take around 30 minutes before they report a failure.

I could try to repeat the exercise using an adjusted version of the terraform code from the AML workspace service if that would be useful. Should I use the same credentials as I have in the TRE code?

dram1964 · 2024-11-21T12:22:15Z

Interesting development - Decided to re-deploy the AML Service into a workspace, this time with expose externally set to True. When I tried to add a compute instance from the user resource template it succeeded, and I can connect and run code on it.

tim-p-allen · 2024-11-22T09:05:20Z

Could potentially be something to do with private endpoints within the vnet?

marrobi · 2024-11-26T05:35:18Z

@dram1964 did you get any further with this? If it is compute size, it doesn't really make sense that it works in one network configuration, but not the other. As @tim-allen-ck says if can deploy the instance through the AML studio it would be useful to identify if the issue is the templates in this project, or a subscription/quota issue.

dram1964 · 2024-11-26T09:19:53Z

@marrobi , @tim-allen-ck: I couldn't see any quota issues with private endpoints in the subscription/region (25/65,000). I've destroyed my original TRE deployment, and created a new one (without my custom templates) in the same subscription/region: but I'm still having the same issue.

marrobi · 2024-11-26T09:24:34Z

Have you tried to create the compute instance via the AML studio?

dram1964 · 2024-11-26T09:51:58Z

Only on a Workspace that had public access. A private AML workspace looked a bit complicated on the Portal - seems I need to create a VNet beforehand to then create private endpoints. I can give a go though.

dram1964 · 2024-11-26T11:26:23Z

Just a quick update: thought I'd try a re-deployment of the TRE in the westeurope region: but the same error occurs when trying to deploy the compute instance.

So I'm going to manually deploy a private AML instance and see how that goes. I'm going to use the terraform quickstart template rather than attempt this in the portal.

dram1964 · 2024-11-26T13:25:47Z

I've setup a AML workspace with public_network_access_enabled = false, and successfully deployed a compute instance using this code.

dram1964 · 2024-12-24T09:13:27Z

Have you tried to create the compute instance via the AML studio?

Hi @marrobi - may have misread this question originally. I've had the same error trying to deploy the compute instance to a private 'azureml' Workspace Service in two different ways:

Using the TRE Portal and deploying a 'aml_compute' user resource
Using the AML studio from a VM within the workspace

marrobi · 2025-01-02T15:26:59Z

@dram1964 I get the same issue. Are you able to open an Azure support case for Azure ML to get a reason why this is happening, and then we can work to resolve?

If provide them with this error they should be able to advise:

marrobi · 2025-01-02T17:22:01Z

I think this is the issue. Still to verify, for some reason the firewall rule has type IP address, not service tag:

marrobi · 2025-01-02T17:26:18Z

A few of the rules seem to have the issue.

Seems the tags being output from:

https://github.com/marrobi/AzureTRE/blob/397ab13d6e215e3902d8609175a4333f1c6825aa/templates/workspace_services/azureml/terraform/outputs.tf#L29-L45

Are invalid.

marrobi · 2025-01-02T19:23:38Z

Fixed...

@dram1964 please do check this fix. Can do an upgrade on existing AML workspace deployment and then deploy a new compute instance.

Something must have changed in the Terraform data source.

dram1964 · 2025-01-03T12:35:52Z

Hi @marrobi - I've applied the fix from marrobi:marrobi/issue4151, and upgraded an existing private AML resource with the new code and it works.

I've tested both:

deploying a compute instance from the portal using the user resource
deploying a compute instance from AMLS from within a workspace VM

Nice work! I've no idea how you worked that out, but it's pretty impressive.

marrobi · 2025-01-03T12:58:01Z

Great, thanks so much for testing. I found this page with the error - https://learn.microsoft.com/en-us/troubleshoot/azure/hpc/batch/azure-batch-pool-resizing-failure#symptom-for-scenario-1 , Azure ML uses Azure Batch for compute, which made me think network/firewall. I could see the traffic being blocked on the firewall, hence allowed it, and worked, then checked the rules and found the error here - #4151 (comment)

We will look to get the PR merged.

dram1964 added the bug Something isn't working label Nov 20, 2024

marrobi self-assigned this Jan 2, 2025

marrobi linked a pull request Jan 2, 2025 that will close this issue

Fix network tags and depreciated TF for Azure ML. #4246

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to deploy a 'Compute Instance' User Resource to a Workspace AML Service #4151

Unable to deploy a 'Compute Instance' User Resource to a Workspace AML Service #4151

dram1964 commented Nov 20, 2024

tim-p-allen commented Nov 21, 2024

dram1964 commented Nov 21, 2024

dram1964 commented Nov 21, 2024

tim-p-allen commented Nov 22, 2024

marrobi commented Nov 26, 2024

dram1964 commented Nov 26, 2024

marrobi commented Nov 26, 2024

dram1964 commented Nov 26, 2024

dram1964 commented Nov 26, 2024

dram1964 commented Nov 26, 2024

dram1964 commented Dec 24, 2024

marrobi commented Jan 2, 2025 •

edited

Loading

marrobi commented Jan 2, 2025

marrobi commented Jan 2, 2025

marrobi commented Jan 2, 2025

dram1964 commented Jan 3, 2025

marrobi commented Jan 3, 2025

Unable to deploy a 'Compute Instance' User Resource to a Workspace AML Service #4151

Unable to deploy a 'Compute Instance' User Resource to a Workspace AML Service #4151

Comments

dram1964 commented Nov 20, 2024

tim-p-allen commented Nov 21, 2024

dram1964 commented Nov 21, 2024

dram1964 commented Nov 21, 2024

tim-p-allen commented Nov 22, 2024

marrobi commented Nov 26, 2024

dram1964 commented Nov 26, 2024

marrobi commented Nov 26, 2024

dram1964 commented Nov 26, 2024

dram1964 commented Nov 26, 2024

dram1964 commented Nov 26, 2024

dram1964 commented Dec 24, 2024

marrobi commented Jan 2, 2025 • edited Loading

marrobi commented Jan 2, 2025

marrobi commented Jan 2, 2025

marrobi commented Jan 2, 2025

dram1964 commented Jan 3, 2025

marrobi commented Jan 3, 2025

marrobi commented Jan 2, 2025 •

edited

Loading