Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spot instance type recommendation not available on recommended region #313

Open
adrianriobo opened this issue Oct 15, 2024 · 8 comments
Assignees

Comments

@adrianriobo
Copy link
Collaborator

There is some checks missing when looking for best spot price / type machine:

We can see:

�[37mDEBU�[0m Based on avg prices for instance types [m4.large m5.large m5a.large m5ad.large m5d.large m5dn.large m5n.large m5zn.large m6a.large m6i.large m6id.large m6idn.large m6in.large m7a.large m7i-flex.large] is az eu-west-2b, current avg price is 0.05 and max price is 0.05 with a score of 9 
�[36mINFO�[0m @ updating.............                      
�[36mINFO�[0m  +  rh:qe:aws:bso main-bso-bso creating (0s)  
�[36mINFO�[0m  +  rh:qe:aws:bso main-bso-bso created       
�[36mINFO�[0m  +  pulumi:pulumi:Stack debug-fedora-spotOption-debug-fedora created (10s)  
�[36mINFO�[0m Outputs:                                     
�[36mINFO�[0m     avg   : 0.0452                           
�[36mINFO�[0m     az    : "eu-west-2b"                     
�[36mINFO�[0m     max   : 0.0452                           
�[36mINFO�[0m     region: "eu-west-2"                      
�[36mINFO�[0m     score : 9                                

But when we try to use the recommended machine type we got:

Diagnostics:
  aws:ec2:Eip (eip-publicmain-afd-net):
    warning: urn:pulumi:stackFedoraBaremetal-debug-fedora::debug-fedora::aws:ec2/eip:Eip::eip-publicmain-afd-net verification warning: use domain attribute instead

  pulumi:pulumi:Stack (debug-fedora-stackFedoraBaremetal-debug-fedora):
    error: update failed

  aws:autoscaling:Group (main-afd-asg):
    error: 1 error occurred:
    	* creating Auto Scaling Group (main-afd-asg-89ce230): operation error Auto Scaling: CreateAutoScalingGroup, https response error StatusCode: 400, RequestID: 0175d099-0523-431f-9dee-a39975af885d, api error ValidationError: The specified instance type m5zn.large is not valid

Resources:
    + 20 created

Duration: 2m45s
@adrianriobo
Copy link
Collaborator Author

This is partially fixed with 77281e5 but still there are inconsistencies for all the types of machines and regions (i.e. windows on aws does not using it at all).

There is an option for using metadata (specs for machines instead of actual types) for spot price searches and for autoscaling groups. We may need consider if we can make us of them maybe on one side or even on both.

@adrianriobo
Copy link
Collaborator Author

Running some pipelines tryting to provision Fedora with arm64 on Azure I got a similar issue:

�[37mDEBU�[0m Best spot price option found: &{standard_d16ps_v5 westus 0.095433} 
�[36mINFO�[0m @ updating....                               
�[36mINFO�[0m  +  azure-native:resources:ResourceGroup fedora-als-rg creating (0s)  
�[36mINFO�[0m  +  tls:index:PrivateKey fedora-als-privatekey-user creating (0s)  
�[36mINFO�[0m @ updating.....                              
�[36mINFO�[0m  +  azure-native:resources:ResourceGroup fedora-als-rg created (1s)  
�[36mINFO�[0m  +  azure-native:network:PublicIPAddress fedora-als-pip creating (0s)  
�[36mINFO�[0m  +  azure-native:network:VirtualNetwork fedora-als-vn creating (0s)  
�[36mINFO�[0m @ updating....                               
�[36mINFO�[0m  +  tls:index:PrivateKey fedora-als-privatekey-user created (2s)  
�[36mINFO�[0m @ updating.....                              
�[36mINFO�[0m  +  azure-native:network:PublicIPAddress fedora-als-pip created (3s)  
�[36mINFO�[0m @ updating......                             
�[36mINFO�[0m  +  azure-native:network:VirtualNetwork fedora-als-vn created (5s)  
�[36mINFO�[0m  +  azure-native:network:Subnet fedora-als-sn creating (0s)  
�[36mINFO�[0m @ updating.......                            
�[36mINFO�[0m  +  azure-native:network:Subnet fedora-als-sn created (4s)  
�[36mINFO�[0m  +  azure-native:network:NetworkInterface fedora-als-ni creating (0s)  
�[36mINFO�[0m @ updating.....                              
�[36mINFO�[0m  +  azure-native:network:NetworkInterface fedora-als-ni created (2s)  
�[36mINFO�[0m  +  azure-native:compute:VirtualMachine fedora-als-vm creating (0s)  
�[36mINFO�[0m @ updating.................................... 
�[36mINFO�[0m  +  azure-native:compute:VirtualMachine fedora-als-vm creating (32s) error: Code="GalleryImageNotFound" Message="\"The gallery image /CommunityGalleries/Fedora-5e266ba4-2250-406d-adad-5d73860d958f/Images/Fedora-Cloud-40-Arm64/Versions/latest is not available in westus region. Please contact image owner to replicate to this region, or change your requested region.\"" Target="imageReference" 
�[36mINFO�[0m  +  azure-native:compute:VirtualMachine fedora-als-vm **creating failed** error: Code="GalleryImageNotFound" Message="\"The gallery image /CommunityGalleries/Fedora-5e266ba4-2250-406d-adad-5d73860d958f/Images/Fedora-Cloud-40-Arm64/Versions/latest is not available in westus region. Please contact image owner to replicate to this region, or change your requested region.\"" Target="imageReference" 
�[36mINFO�[0m  +  pulumi:pulumi:Stack fedora-stackAzureLinux-fedora creating (48s) error: update failed 
�[36mINFO�[0m  +  pulumi:pulumi:Stack fedora-stackAzureLinux-fedora **creating failed (48s)** 1 error 
�[36mINFO�[0m Diagnostics:                                 
�[36mINFO�[0m   azure-native:compute:VirtualMachine (fedora-als-vm): 
�[36mINFO�[0m     error: Code="GalleryImageNotFound" Message="\"The gallery image /CommunityGalleries/Fedora-5e266ba4-2250-406d-adad-5d73860d958f/Images/Fedora-Cloud-40-Arm64/Versions/latest is not available in westus region. Please contact image owner to replicate to this region, or change your requested region.\"" Target="imageReference" 
�[36mINFO�[0m                                              
�[36mINFO�[0m   pulumi:pulumi:Stack (fedora-stackAzureLinux-fedora): 

@anjannath
Copy link
Collaborator

anjannath commented Nov 11, 2024

So for aws we need to check the spot prices based on the VM specs instead of directly using EC2 specific vm type names, for Azure i think we'll have to have a another filter, first we find out the VM candidates then filter it again based on the requested OS if that image is available in the selected location and finally return the result

for azure i also noticed that some regions don't have support for Resource Groups, which is a hard requirement for mapt as of now, so we can limit the spot search by default to the regions that does support Resource Groups

Diagnostics:
  azure-native:resources:ResourceGroup (az-ghrunner-awd-rg):
    error: autorest/azure: Service returned an error. Status=400 Code="LocationNotAvailableForResourceGroup" Message="The provided location 'southafricawest' is not available for resource group. List of available regions is 'eastasia,southeastasia,australiaeast,australiasoutheast,brazilsouth,canadacentral,canadaeast,switzerlandnorth,germanywestcentral,eastus2,eastus,centralus,northcentralus,francecentral,uksouth,ukwest,centralindia,southindia,jioindiawest,italynorth,japaneast,japanwest,koreacentral,koreasouth,mexicocentral,northeurope,norwayeast,polandcentral,qatarcentral,spaincentral,swedencentral,uaenorth,westcentralus,westeurope,westus2,westus,southcentralus,westus3,southafricanorth,australiacentral,australiacentral2,israelcentral,westindia,newzealandnorth'."

@adrianriobo
Copy link
Collaborator Author

Yeah seems reasonable. On my partial fix I applied the suggested fix you said for azure on AWS, so if we want to change everything there (meaning instead of check spot by type machine use directly the specs) we can have it a separate issue (enhancement).

On the azure side definitely apply the second filter and remove regions not supporting Resource Groups.

Just as a side note on this last thing, do not want to complicate things but if I am not wrong nothing is preventing to have the resource group on a different region than the actual resources it groups.

@anjannath
Copy link
Collaborator

Just as a side note on this last thing, do not want to complicate things but if I am not wrong nothing is preventing to have the resource group on a different region than the actual resources it groups.

For azure instead of modifying the spot query, what we can do is to check if the returned location by the spot calculation is among the list of locations where "resource group" resource is supported, if yes then we use this location, if not then we use a hard-coded default location or the value of AZURE_LOCATION to create the resource group

@adrianriobo
Copy link
Collaborator Author

Today I see this on nightly:

@ updating....
 +  azure-native:compute:VirtualMachine fedora-als-vm creating (1s) error: Code="GalleryImageNotFound" Message="\"The gallery image /CommunityGalleries/Fedora-5e266ba4-2250-406d-adad-5d73860d958f/Images/Fedora-Cloud-40-Arm64/Versions/latest is not available in northcentralus region. Please contact image owner to replicate to this region, or change your requested region.\"" Target="imageReference"
 +  azure-native:compute:VirtualMachine fedora-als-vm **creating failed** error: Code="GalleryImageNotFound" Message="\"The gallery image /CommunityGalleries/Fedora-5e266ba4-2250-406d-adad-5d73860d958f/Images/Fedora-Cloud-40-Arm64/Versions/latest is not available in northcentralus region. Please contact image owner to replicate to this region, or change your requested region.\"" Target="imageReference"
 +  pulumi:pulumi:Stack fedora-stackAzureLinux-fedora creating (14s) error: update failed
 +  pulumi:pulumi:Stack fedora-stackAzureLinux-fedora **creating failed (14s)** 1 error
Diagnostics:
  azure-native:compute:VirtualMachine (fedora-als-vm):
    error: Code="GalleryImageNotFound" Message="\"The gallery image /CommunityGalleries/Fedora-5e266ba4-2250-406d-adad-5d73860d958f/Images/Fedora-Cloud-40-Arm64/Versions/latest is not available in northcentralus region. Please contact image owner to replicate to this region, or change your requested region.\"" Target="imageReference"

  pulumi:pulumi:Stack (fedora-stackAzureLinux-fedora):
    error: update failed

which is kind of related

@anjannath
Copy link
Collaborator

For the fedora image not being present, we have to first find out all the locations where this image is available, or we can also report this to the fedora people if they want to mirror the images to all locations.

for mapt i think we can handle this either by compiling a list of all the regions where this image is available and apply filtering to the location suggested by the spot calculation, or we can use the azure sdk to check that the location offered by the spot calculation has the fedora image if not fail the operation or use the next best location

@anjannath
Copy link
Collaborator

in #353 we have hardcoded the list of locations that support Resource Groups for Azure, it'd better to fetch this list of locations from Azure dynamically with ARM graph queries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants