-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to deploy a 'Compute Instance' User Resource to a Workspace AML Service #4151
Comments
Hi @dram1964, can you create an AML in the portal manually? |
Hi @tim-allen-ck - logged-in as the Global Admin for the tenant, I've created an AML workspace with basic settings (public access) in the UK South region and added a compute which completed in 5 minutes or so. My efforts via the TRE usually take around 30 minutes before they report a failure. I could try to repeat the exercise using an adjusted version of the terraform code from the AML workspace service if that would be useful. Should I use the same credentials as I have in the TRE code? |
Interesting development - Decided to re-deploy the AML Service into a workspace, this time with |
Could potentially be something to do with private endpoints within the vnet? |
@dram1964 did you get any further with this? If it is compute size, it doesn't really make sense that it works in one network configuration, but not the other. As @tim-allen-ck says if can deploy the instance through the AML studio it would be useful to identify if the issue is the templates in this project, or a subscription/quota issue. |
@marrobi , @tim-allen-ck: I couldn't see any quota issues with private endpoints in the subscription/region (25/65,000). I've destroyed my original TRE deployment, and created a new one (without my custom templates) in the same subscription/region: but I'm still having the same issue. |
Have you tried to create the compute instance via the AML studio? |
Only on a Workspace that had public access. A private AML workspace looked a bit complicated on the Portal - seems I need to create a VNet beforehand to then create private endpoints. I can give a go though. |
Just a quick update: thought I'd try a re-deployment of the TRE in the westeurope region: but the same error occurs when trying to deploy the compute instance. So I'm going to manually deploy a private AML instance and see how that goes. I'm going to use the terraform quickstart template rather than attempt this in the portal. |
I've setup a AML workspace with |
Hi @marrobi - may have misread this question originally. I've had the same error trying to deploy the compute instance to a private 'azureml' Workspace Service in two different ways:
|
@dram1964 I get the same issue. Are you able to open an Azure support case for Azure ML to get a reason why this is happening, and then we can work to resolve? If provide them with this error they should be able to advise: |
A few of the rules seem to have the issue. Seems the tags being output from: Are invalid. |
Fixed... @dram1964 please do check this fix. Can do an upgrade on existing AML workspace deployment and then deploy a new compute instance. Something must have changed in the Terraform data source. |
Hi @marrobi - I've applied the fix from marrobi:marrobi/issue4151, and upgraded an existing private AML resource with the new code and it works. I've tested both:
Nice work! I've no idea how you worked that out, but it's pretty impressive. |
Great, thanks so much for testing. I found this page with the error - https://learn.microsoft.com/en-us/troubleshoot/azure/hpc/batch/azure-batch-pool-resizing-failure#symptom-for-scenario-1 , Azure ML uses Azure Batch for compute, which made me think network/firewall. I could see the traffic being blocked on the firewall, hence allowed it, and worked, then checked the rules and found the error here - #4151 (comment) We will look to get the PR merged. |
Deployment of AML Compute Instance fails
When adding a Compute Instance to a TRE Workspace AML Service, the deployment fails with the following error:
desired number of dedicated nodes could not be allocated
. This error has been happening consistently for the pasttwo days. Have not tried it before then with this version of the TRE.
This error occurs when deploying via:
Steps to reproduce
Additional Steps taken
Additional Info
Azure TRE release version: v0.19.1
tre-workspace-base: 1.5.7
tre-service-azureml: 0.8.11
tre-user-resource-aml-compute-instance: 0.5.7
deployment location: UKSouth
The text was updated successfully, but these errors were encountered: