Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to deploy a 'Compute Instance' User Resource to a Workspace AML Service #4151

Open
dram1964 opened this issue Nov 20, 2024 · 17 comments · May be fixed by #4246
Open

Unable to deploy a 'Compute Instance' User Resource to a Workspace AML Service #4151

dram1964 opened this issue Nov 20, 2024 · 17 comments · May be fixed by #4246
Assignees
Labels
bug Something isn't working

Comments

@dram1964
Copy link

Deployment of AML Compute Instance fails

When adding a Compute Instance to a TRE Workspace AML Service, the deployment fails with the following error:
desired number of dedicated nodes could not be allocated. This error has been happening consistently for the past
two days. Have not tried it before then with this version of the TRE.

This error occurs when deploying via:

  1. the TRE UI using the 'aml_compute' user-resource template and
  2. Logging into the AML Workspace from a workspace VM and trying to create new compute instance

Steps to reproduce

  1. Create New Workspace and User
  2. Add User to 'Workspace Owners'
  3. Add User to 'Workspace Researchers'
  4. Login to TRE UI with User account
  5. Add Virtual Desktops (Guacamole) Service to Workspace
  6. Add a User Resource (VM) to Virtual Desktops Service
  7. Add Azure ML Service to Workspace ('expose externally' = False)
  8. Add a Compute instance User Resource to AML Service

Additional Steps taken

  1. Grant User 'Network Contributor' on the TRE Workspace VNet
  2. Grant User 'AzureML Compute Operator' on the Workspace AML Workspace

Additional Info

  • there is sufficient quota for the selected compute-size in my deployment region
  • I have tried a number of different compute sizes
  • I have confirmed that there are free IP addresses in the AML Subnet.
  • all resources are deployed to the UK South Region.

Azure TRE release version: v0.19.1
tre-workspace-base: 1.5.7
tre-service-azureml: 0.8.11
tre-user-resource-aml-compute-instance: 0.5.7
deployment location: UKSouth

@dram1964 dram1964 added the bug Something isn't working label Nov 20, 2024
@tim-p-allen
Copy link
Collaborator

Hi @dram1964, can you create an AML in the portal manually?

@dram1964
Copy link
Author

Hi @tim-allen-ck - logged-in as the Global Admin for the tenant, I've created an AML workspace with basic settings (public access) in the UK South region and added a compute which completed in 5 minutes or so. My efforts via the TRE usually take around 30 minutes before they report a failure.

I could try to repeat the exercise using an adjusted version of the terraform code from the AML workspace service if that would be useful. Should I use the same credentials as I have in the TRE code?

@dram1964
Copy link
Author

Interesting development - Decided to re-deploy the AML Service into a workspace, this time with expose externally set to True. When I tried to add a compute instance from the user resource template it succeeded, and I can connect and run code on it.

@tim-p-allen
Copy link
Collaborator

Could potentially be something to do with private endpoints within the vnet?

@marrobi
Copy link
Member

marrobi commented Nov 26, 2024

@dram1964 did you get any further with this? If it is compute size, it doesn't really make sense that it works in one network configuration, but not the other. As @tim-allen-ck says if can deploy the instance through the AML studio it would be useful to identify if the issue is the templates in this project, or a subscription/quota issue.

@dram1964
Copy link
Author

@marrobi , @tim-allen-ck: I couldn't see any quota issues with private endpoints in the subscription/region (25/65,000). I've destroyed my original TRE deployment, and created a new one (without my custom templates) in the same subscription/region: but I'm still having the same issue.

@marrobi
Copy link
Member

marrobi commented Nov 26, 2024

Have you tried to create the compute instance via the AML studio?

@dram1964
Copy link
Author

Only on a Workspace that had public access. A private AML workspace looked a bit complicated on the Portal - seems I need to create a VNet beforehand to then create private endpoints. I can give a go though.

@dram1964
Copy link
Author

Just a quick update: thought I'd try a re-deployment of the TRE in the westeurope region: but the same error occurs when trying to deploy the compute instance.

So I'm going to manually deploy a private AML instance and see how that goes. I'm going to use the terraform quickstart template rather than attempt this in the portal.

@dram1964
Copy link
Author

I've setup a AML workspace with public_network_access_enabled = false, and successfully deployed a compute instance using this code.

@dram1964
Copy link
Author

Have you tried to create the compute instance via the AML studio?

Hi @marrobi - may have misread this question originally. I've had the same error trying to deploy the compute instance to a private 'azureml' Workspace Service in two different ways:

  1. Using the TRE Portal and deploying a 'aml_compute' user resource
  2. Using the AML studio from a VM within the workspace

@marrobi
Copy link
Member

marrobi commented Jan 2, 2025

@dram1964 I get the same issue. Are you able to open an Azure support case for Azure ML to get a reason why this is happening, and then we can work to resolve?

If provide them with this error they should be able to advise:

Image

@marrobi
Copy link
Member

marrobi commented Jan 2, 2025

I think this is the issue. Still to verify, for some reason the firewall rule has type IP address, not service tag:

Image

@marrobi
Copy link
Member

marrobi commented Jan 2, 2025

A few of the rules seem to have the issue.

Seems the tags being output from:

https://github.com/marrobi/AzureTRE/blob/397ab13d6e215e3902d8609175a4333f1c6825aa/templates/workspace_services/azureml/terraform/outputs.tf#L29-L45

Are invalid.

@marrobi marrobi self-assigned this Jan 2, 2025
@marrobi marrobi linked a pull request Jan 2, 2025 that will close this issue
@marrobi
Copy link
Member

marrobi commented Jan 2, 2025

Fixed...

Image

@dram1964 please do check this fix. Can do an upgrade on existing AML workspace deployment and then deploy a new compute instance.

Something must have changed in the Terraform data source.

@dram1964
Copy link
Author

dram1964 commented Jan 3, 2025

Hi @marrobi - I've applied the fix from marrobi:marrobi/issue4151, and upgraded an existing private AML resource with the new code and it works.

I've tested both:

  1. deploying a compute instance from the portal using the user resource
  2. deploying a compute instance from AMLS from within a workspace VM

Image

Nice work! I've no idea how you worked that out, but it's pretty impressive.

@marrobi
Copy link
Member

marrobi commented Jan 3, 2025

Great, thanks so much for testing. I found this page with the error - https://learn.microsoft.com/en-us/troubleshoot/azure/hpc/batch/azure-batch-pool-resizing-failure#symptom-for-scenario-1 , Azure ML uses Azure Batch for compute, which made me think network/firewall. I could see the traffic being blocked on the firewall, hence allowed it, and worked, then checked the rules and found the error here - #4151 (comment)

We will look to get the PR merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants