Skip to content

Latest commit

 

History

History
414 lines (286 loc) · 35.9 KB

troubleshooting-guide.md

File metadata and controls

414 lines (286 loc) · 35.9 KB

FAQs, advanced troubleshooting and known issues for Cromwell on Azure

This article answers FAQs, describes advanced features that allow customization and debugging of Cromwell on Azure, as well as how to diagnose, debug, and work around known issues. We are actively tracking these as bugs to be fixed in upcoming releases!

  1. Setup

  2. Analysis

  3. Customizing your instance

  4. Performance & Optimization

  5. Miscellaneous

Known Issues And Mitigation

I am trying to use files with SAS tokens but run into file access issues

There is currently a bug (which we are tracking) in a dependency tool we use to get files from Azure Storage to the VM to perform a task. For now, follow these steps as a workaround if you are running into errors getting access to your files using SAS tokens on Cromwell on Azure. If you followed these instructions to create a SAS URL, you’ll get something similar to

https://YourStorageAccount.blob.core.windows.net/inputs?sv=2018-03-28si=inputs-key&sr=c&sig=somestring

Focus on this part: si=inputs-key&sr=c

Manually change order of sr and si fields to get something similar to

https://YourStorageAccount.blob.core.windows.net/inputs?sv=2018-03-28&sr=c&si=inputs-keysig=somestring

After the change, sr=c&si=inputs-key should be the order in your SAS URL.

Update all the SAS URLs similarly and retry your workflow.

All TES tasks for my workflow are done running, but the trigger JSON file is still in the "inprogress" directory in the workflows container

  1. The root cause is most likely memory pressure on the host Linux VM because blobfuse processes grow to consume all physical memory.

You may see the following Cromwell container logs as a symptom:

Cromwell shutting down because it cannot access the database): Shutting down cromid-5bd1d24 as at least 15 minutes of heartbeat write errors have occurred between 2020-02-18T22:03:01.110Z and 2020-02-18T22:19:01.111Z (16.000016666666667 minutes

To mitigate, please resize your VM in the resource group to a machine with at least 14GB memory/RAM. Any workflows still in progress will not be affected.

Resize VM

  1. Another possible scenario is that the "mysql" database is in an unusable state, which means Cromwell cannot continue processing workflows.

You may see the following Cromwell container logs as a symptom:

Failed to instantiate Cromwell System. Shutting down Cromwell. liquibase.exception.LockException: Could not acquire change log lock. Currently locked by 012ec19c3285 (172.18.0.4) since 2/19/20 4:10 PM

Note: This has been fixed in Release 2.1. If you use the 2.1 deployer or update to this version, you can skip the mitigation steps below

For Release 2.0 and below: To mitigate, log on to the host VM and execute the following and then restart the VM:

sudo docker exec -it cromwellazure_mysqldb_1 bash -c 'mysql -ucromwell -Dcromwell_db -pcromwell -e"SELECT * FROM DATABASECHANGELOGLOCK;UPDATE DATABASECHANGELOGLOCK SET LOCKED=0, LOCKGRANTED=null, LOCKEDBY=null where ID=1;SELECT * FROM DATABASECHANGELOGLOCK;"'

Setup

Setup Cromwell on Azure for multiple users in the same Azure subscription

Cromwell on Azure is designed to be flexible for single and multiple user scenarios. Here we have envisioned 4 general scenarios and demonstrated how they relate to your Azure account, Azure Batch service, Subscription ID, and Resource Groups, each depicted below.

Multiple Users FAQ

  1. The Individual User: This is the current standard deployment configuration for Cromwell on Azure. No extra steps beyond the deployment guide are necessary.

  2. The Lab: This scenario is envisioned for small lab groups and teams sharing a common Azure resource (ie. a common bioinformatician(s), data scientist(s), or computational biologist(s) collaborating on projects from the same lab). Functionally, this setup does not differ from the "Individual User" configuration. We recommend a single "Cromwell Administrator" perform the initial Cromwell on Azure setup for the group. Ensure that this user has the appropriate role(s) on the Subscription ID as outlined here. Once deployed, this "Cromwell Administrator" can grant "Contributor" access to the created Cromwell storage account via the Azure Portal. This would allow granted users the ability to submit analysis jobs and retrieve results. It would also allow them the ability to view any analysis that has been run by the lab. As Cromwell submits all jobs to Azure Batch as one user, the billing for Cromwell on Azure usage would be collective for the entire lab, not broken down by individual users who submitted the jobs.

  3. The Research Group: This scenario is envisioned for larger research groups where a common Azure subscription is shared, but users want/require their own instance of Cromwell on Azure. The initial Cromwell on Azure deployment is done as described in the deployment guide. After the first deployment of Cromwell on Azure is done on the Subscription, subsequent users will need to specify a separate Resource Group AND preexisting Azure Batch account name that is currently being utilized by the pre-existing deployment(s) of Cromwell on Azure. The Azure Batch account must exist in the same region as defined in the "--RegionName" configuration of the new Cromwell on Azure deployment. You can check all the configuration options here. See the invocation of the Linux deployment script for an example:

.\deploy-cromwell-on-azure-linux --SubscriptionId <Your subscription ID> --RegionName <Your region> --MainIdentifierPrefix <Your string> --ResourceGroupName <Your resource group> --BatchAccountName <Your Batch account name>

In this scenario, please note the lack of separation at the Azure Batch account level. While you will be able track resource usage independently due to the separate Cromwell users submitting analyses to Azure Batch (for your own tracking/internal billing purposes), anyone who has access to Azure Batch as a Contributor or Owner will be able to see everyone's Batch pools, and thus what they are running. For this scenario, we would recommend the Cromwell Administrator(s) be trusted personnel, such as your IT team.

  1. The Institution: This is an enterprise level deployment scenario for a large organization with multiple Subscriptions and independent user groups within an internal hierarchy. In this scenario, due to the independent nature of the work being done and the desire/need to track specific resource usage (for your own internal billing purposes) you will have completely independent deployments of Cromwell on Azure.

    To deploy, you'll need to verify whether an existing Azure Batch account already exists on your Subscription (to run Cromwell on Azure on the Subscription level), or within your Resource Group as described in the deployment guide, with appropriate roles set. If Azure Batch account is not deployed on your Subscription (or if you have available quota to create a new Batch account - the default for most accounts is 1 Batch account/region), then simply follow the deployment guide. If there is an existing Azure Batch account you're connecting to within your Subscription, simply follow the deployment recommendations outlined in [3], adding the appropriate flags for the deployment script. See the invocation of the Linux deployment script for an example:

.\deploy-cromwell-on-azure-linux --SubscriptionId <Your subscription ID> --RegionName <Your region> --MainIdentifierPrefix <Your string> --ResourceGroupName <Your resource group> --BatchAccountName <Your Batch account name>

Please note you can also mix scenarios 1, 2, and 3 within the Azure Enterprise Account in scenario 4.

Debug my Cromwell on Azure installation that ran into an error

When the Cromwell on Azure installer is run, if there are errors, the logs are printed in the terminal. Most errors are related to insufficient permissions to create resources in Azure on your behalf, or intermittent Azure failures. In case of an error, we terminate the installation process and begin deleting all the resources in the Resource Group if already created.

Deleting all the resources in the Resource Group may take a while but as soon as you see logs that the batch account was deleted, you may exit the current process using Ctrl+C or Command+C on terminal/command prompt/PowerShell. The deletion of other Azure resources can continue in the background on Azure. Re-run the installer after fixing any user errors like permissions from the previous try.

If you see an issue that is unrelated to your permissions, and re-trying the installer does not fix it, please file a bug on our GitHub issues.

Upgrade my Cromwell on Azure instance

Starting in version 1.x, for convenience, some configuration files are hosted on your Cromwell on Azure storage account, in the "configuration" container - containers-to-mount, and cromwell-application.conf. You can modify and save these file using Azure Portal UI "Edit Blob" option or simply upload a new file to replace the existing one. Follow these steps to upgrade your Cromwell on Azure instance to 2.x.

Analysis

Job failed immediately

If a workflow you start has a task that failed immediately and lead to workflow failure be sure to check your input JSON files. Follow the instructions here and check out an example WDL and inputs JSON file here to ensure there are no errors in defining your input files.

For files hosted on an Azure Storage account that is connected to your Cromwell on Azure instance, the input path consists of 3 parts - the storage account name, the blob container name, file path with extension, following this format:

/<storageaccountname>/<containername>/<blobName>

Example file path for an "inputs" container in a storage account "msgenpublicdata" will look like "/msgenpublicdata/inputs/chr21.read1.fq.gz"

Another possibility is that you are trying to use a storage account that hasn't been mounted to your Cromwell on Azure instance - either by default during setup or by following these steps to mount a different storage account.

Check out these known issues and mitigation for more commonly seen issues caused by bugs we are actively tracking.

Check Azure Batch account quotas

If you are running a task in a workflow with a large cpu cores requirement, check if your Batch account has enough resource quotas. You can request more quotas by following these instructions.

For other resource quotas, like active jobs or pools, if there are not enough resources available, Cromwell on Azure keeps the tasks in queue until resources become available. This may lead to longer wait times for workflow completion.

Set up my own WDL

To get started you can view this Hello World sample, an example WDL to convert FASTQ to UBAM or follow these steps to convert an existing public WDL for other clouds to run on Azure.
There are also links to ready-to-try WDLs for common workflows here

Instructions to write a WDL file for a pipeline from scratch are COMING SOON.

Check all tasks running for a workflow using batch account

Each task in a workflow starts an Azure Batch VM. To see currently active tasks, navigate to your Azure Batch account connected to Cromwell on Azure on Azure Portal. Click on "Jobs" and then search for the Cromwell workflowId to see all tasks associated with a workflow.

Batch account

Find which tasks failed in a workflow

Cosmos DB stores information about all tasks in a workflow. For monitoring or debugging any workflow you may choose to query the database.

Navigate to your Cosmos DB instance on Azure Portal. Click on the "Data Explorer" menu item, Click on the "TES" container and select "Items".

Cosmos DB SQL query

You can write a SQL query to get all tasks that have not completed successfully in a workflow using the following query, replacing workflowId with the id returned from Cromwell for your workflow:

SELECT * FROM c where startswith(c.description,"workflowId") AND c.state != "COMPLETE"

OR

SELECT * FROM c where startswith(c.id,"<first 9 character of the workflowId>") AND c.state != "COMPLETE"

Make sure there are no Azure infrastructure errors

When working with Cromwell on Azure, you may run into issues with Azure Batch or Storage accounts. For instance, if a file path cannot be found or if the WDL workflow failed with an unknown reason. For these scenarios, consider debugging or collecting more information using Application Insights.

Navigate to your Application Insights instance on Azure Portal. Click on the "Logs (Analytics)" menu item under the "Monitoring" section to get all logs from Cromwell on Azure's TES backend.

App insights

You can explore exceptions or logs to find the reason for failure, and use time ranges or Kusto Query Language to narrow your search.

Check Azure Storage Tier

Cromwell utilizes Blob storage containers and Blobfuse to allow your data to be accessed and processed. The Blob Storage Access Tier can have a demonstrable effect on your analysis time, particularly on your initial VM preparation. If you experience this, we would recommend setting your access tier to "Hot" instead of "Cool". You can do this under the "Access Tier" settings in the "Configuration" menu on Azure Portal. NOTE: this only affects users utilizing Gen2 Storage Accounts. All Gen 1 "Standard" blobs are access tier "Hot" by default.

Customizing your Cromwell on Azure instance

Connect to the host VM that runs all the Docker containers

To get logs from all the Docker containers or to use the Cromwell REST API endpoints, you may want to connect to the Linux host VM. At installation, a user is created to allow managing the host VM with username "vmadmin". The password is randomly generated and shown during installation. If you need to reset your VM password, you can do this using the Azure Portal or by following these instructions.

Reset password

To connect to your host VM, you can either

  1. Construct your ssh connection string if you have the VM name ssh vmadmin@<hostname> OR
  2. Navigate to the Connect button on the Overview blade of your Azure VM instance, then copy the ssh connection string.

Paste the ssh connection string in a command line, PowerShell or terminal application to log in.

Connect with SSH

Customize your Cromwell on Azure deployment

Before deploying, you can choose to customize some input parameters to use existing Azure resources. Example:

.\deploy-cromwell-on-azure.exe --SubscriptionId <Your subscription ID> --RegionName <Your region> --MainIdentifierPrefix <Your string> --VmSize "Standard_D2_v2"

Here is the summary of all configuration parameters:

Configuration parameter Has default Validated Used by update Comment
string SubscriptionId N Y Y Azure Subscription Id - Always required
string RegionName N Y N Azure region name to deploy to - Required for new install
string MainIdentifierPrefix = "coa" Y Y N Prefix for all resources to be deployed - Required to deploy but defaults to "coa"
string VmOsVersion = "18.04-LTS" Y N N OS Version of the Linux Ubuntu VM to use as the host - Not required and defaults to Ubuntu 18.04 LTS
string VmSize = "Standard_D3_v2" Y N N VM size of the Linux Ubuntu VM to use as the host - Not required and defaults to Standard_D3_v2
string VmUsername = "vmadmin"; Y N Y Username created on Cromwell on Azure Linux host - Not required and defaults to "vmadmin"
string VmPassword Y N Y Required for update
string VnetResourceGroupName Y Y N Available starting version 2.1. The resource group name of the specified virtual network to use - Not required, generated automatically if not provided. If specified, VnetName and SubnetName must be provided.
string VnetName Y Y N Available starting version 2.1. The name of the specified virtual network to use - Not required, generated automatically if not provided. If specified, VnetResourceGroupName and SubnetName must be provided.
string SubnetName Y Y N Available starting version 2.1. The subnet name of the specified virtual network to use - Not required, generated automatically if not provided. If specified, VnetResourceGroupName and VnetName must be provided.
string ResourceGroupName Y Y Y Required for update. If provided for new Cromwell on Azure deployment, it must already exist.
string BatchAccountName Y N N The name of the Azure Batch Account to use ; must be in the SubscriptionId and RegionName provided - Not required, generated automatically if not provided
string StorageAccountName Y N N The name of the Azure Storage Account to use ; must be in the SubscriptionId provided - Not required, generated automatically if not provided
string NetworkSecurityGroupName Y N N The name of the Network Security Group to use; must be in the SubscriptionId provided - Not required, generated automatically if not provided
string CosmosDbAccountName Y N N The name of the Cosmos Db Account to use; must be in the SubscriptionId provided - Not required, generated automatically if not provided
string ApplicationInsightsAccountName Y N N The name of the Application Insights Account to use; must be in the SubscriptionId provided - Not required, generated automatically if not provided
string VmName Y N Y Name of the VM host that is part of the Cromwell on Azure deployment to update - Required for update if multiple VMs exist in the resource group
string CromwellVersion Y N Y Cromwell version to use
bool SkipTestWorkflow = false; Y Y Y Set to true to skip running the default test workflow
bool Update = false; Y Y Y Set to true if you want to update your existing Cromwell on Azure deployment to the latest version. Required for update
bool PrivateNetworking = false; Y Y N Available starting version 2.2. Set to true to create the host VM without public IP address. If set, VnetResourceGroupName, VnetName and SubnetName must be provided (and already exist). The deployment must be initiated from a machine that has access to that subnet.

Use a specific Cromwell version

Before deploying Cromwell on Azure

To choose a specific Cromwell version, you can specify the version as a configuration parameter before deploying Cromwell on Azure. Here is an example:

.\deploy-cromwell-on-azure.exe --SubscriptionId <Your subscription ID> --RegionName <Your region> --MainIdentifierPrefix <Your string> --CromwellVersion 53

This version will persist through future updates until you set it again or revert to the default behavior by specifying --CromwellVersion "". See note below.

After Cromwell on Azure has been deployed

After deployment, you can still change the Cromwell docker image version being used.

Cromwell on Azure version 2.x

Run the deployer in update mode and specify the new Cromwell version.

.\deploy-cromwell-on-azure.exe --Update true --SubscriptionId <Your subscription ID> --ResourceGroupName <Your RG> --VmPassword <Your VM password> --CromwellVersion 54

The new version will persist through future updates until you set it again. To revert to the default Cromwell version that is shipped with each deployer version, specify --CromwellVersion "". Be aware of compatibility issues if downgrading the version. The default version is listed here.

Cromwell on Azure version 1.x

Log on to the host VM using the ssh connection string as described in the instructions. Replace image name with the tag of your choice for the "cromwell" service in the docker-compose.yml file.

cd /data/cromwellazure/
sudo nano docker-compose.yml
# Modify the cromwell service image name and save the file

For these changes to take effect, be sure to restart your Cromwell on Azure VM through the Azure Portal UI or run sudo reboot. or run sudo reboot. You can also restart the docker containers.

Use input data files from an existing Azure storage account that my lab or team is currently using

If the VM can be granted 'Contributor' access to the storage account:
  1. Add the VM identity as a Contributor to the Storage Account via Azure Portal or Azure CLI.

  2. Navigate to the "configuration" container in the default storage account. Replace the values below with your Storage Account and Container names and add the line to the end of the containers-to-mount file:

    /yourstorageaccountname/yourcontainername
    
  3. Save the changes and restart the VM

If the VM cannot be granted Contributor access to the storage account:

This is applicable if the VM and storage account are in different Azure tenants, or if you want to use SAS token anyway for security reasons

  1. Add a SAS url for your desired container to the end of the containers-to-mount file. The SAS token can be at the account or container level and may be read-only or read-write depending on the usage.

    https://<yourstorageaccountname>.blob.core.windows.net:443/<yourcontainername>?<sastoken>
    
  2. Save the changes and restart the VM

In both cases, the specified containers will be mounted as /yourstorageaccountname/yourcontainername/ on the Cromwell server. You can then use /yourstorageaccountname/yourcontainername/path in the trigger, WDL, CWL, inputs and workflow options files.

Use a batch account for which I have already requested or received increased cores quota from Azure Support

Log on to the host VM using the ssh connection string as described in the instructions.

Cromwell on Azure version 2.x

Replace BatchAccountName variable in the env-01-account-names.txt file with the name of the desired batch account and save your changes.

cd /data/cromwellazure/
sudo nano env-01-account-names.txt
# Modify the BatchAccountName to your Batch Account name and save the file

Cromwell on Azure version 1.x

Replace BatchAccountName environment variable for the "tes" service in the docker-compose.yml file with the name of the desired batch account and save your changes.

cd /data/cromwellazure/
sudo nano docker-compose.yml
# Modify the BatchAccountName to your Batch Account name and save the file

To allow the host VM to use a batch account, add the VM identity as a Contributor to the Azure batch account via Azure Portal or Azure CLI.
To allow the host VM to read prices and information about types of machines available for the batch account, add the VM identity as a Billing Reader to the subscription with the configured Batch Account.

For these changes to take effect, be sure to restart your Cromwell on Azure VM through the Azure Portal UI or run sudo reboot. or run sudo reboot.

Use private Docker containers hosted on Azure

Cromwell on Azure supports private Docker images for your WDL tasks hosted on Azure Container Registry or ACR. To allow the host VM to use an ACR, add the VM identity as a Contributor to the Container Registry via Azure Portal or Azure CLI.

Configure my Cromwell on Azure instance to always use dedicated batch VMs to avoid getting preempted

By default, your workflows will run on low priority Azure batch nodes.

If you prefer to use dedicated Azure Batch nodes for all tasks, do the following:

Cromwell on Azure version 2.x

In file cromwell-application.conf, in the configuration container in the default storage account, in backend section, change preemptible: true to preemptible: false. Save your changes and restart the VM.

Note that you can override this setting for each task individually by setting the preemptible boolean flag to true or false in the "runtime" attributes section of your task.

Cromwell on Azure version 1.x

Log on to the host VM using the ssh connection string as described in the instructions. Change the UsePreemptibleVmsOnly environment variable for the "tes" service to "false" in the docker-compose.yml file and save your changes.

cd /data/cromwellazure/
sudo nano docker-compose.yml
# Modify UsePreemptibleVmsOnly to false and save the file

For these changes to take effect, be sure to restart your Cromwell on Azure VM through the Azure Portal UI or run sudo reboot.

Access the Cromwell REST API directly from Linux host VM

Cromwell is run in server mode on the Linux host VM. After logging in to the host VM, it can be accessed via curl as described below:

Get all workflows

curl -X GET "http://localhost:8000/api/workflows/v1/query" -H "accept: application/json"

Get specific workflow's status by id
curl -X GET "http://localhost:8000/api/workflows/v1/{id}/status" -H "accept: application/json"

Get call-caching difference between two workflow calls
curl -X GET "http://localhost:8000/api/workflows/v1/callcaching/diff?workflowA={workflowId1}&callA={workflowName.callName1}&workflowB={workflowId2}&callB={workflowName.callName2}" -H "accept: application/json"

You can perform other Cromwell API calls following a similar pattern. To see all available API endpoints, see Cromwell's REST API here

Performance and Optimization

Cost analysis for Cromwell on Azure

To learn more about your Cromwell on Azure Resource Group's cost, navigate to the "Cost Analysis" menu item in the "Cost Management" section of your Azure Resource Group on the Azure Portal. More information here.
RG cost analysis

You can also use the Pricing Calculator to estimate your monthly cost.

How Cromwell on Azure selects batch VMs to run tasks in a workflow

VM price data is used to select the cheapest per hour VM for a task's runtime requirements, and is also stored in the TES database to allow calculation of total workflow cost. VM price data is obtained from the Azure RateCard API. Accessing the Azure RateCard API requires the VM's Billing Reader role to be assigned to your Azure subscription scope. If you don't have Owner, or both Contributor and User Access Administrator roles assigned to your Azure subscription, the deployer will not be able to complete this on your behalf - you will need to contact your Azure subscription administrator(s) to complete this for you. You will see a warning in the TES logs indicating that default VM prices are being used until this is resolved.

Optimize my WDLs

This section is COMING SOON.

Miscellaneous

Get container logs to debug issues

The host VM is running multiple Docker containers that enable Cromwell on Azure - mysql, broadinstitute/cromwell, cromwellonazure/tes, cromwellonazure/triggerservice. On rare occasions, you may want to debug and diagnose issues with the Docker containers. After logging in to the host VM, run:

sudo docker ps

This command will list the names of all the Docker containers currently running. To get logs for a particular container, run:

sudo docker logs 'containerName'

I am running a large amount of workflows and MySQL storage disk is full

To ensure that no data is corrupted for MySQL backed storage for Cromwell, Cromwell on Azure mounts MySQL files on to an Azure Managed Data Disk of size 32G. In case there is a need to increase the size of this data disk, follow instructions here.

Running CWL Workflows on Cromwell on Azure

Running workflows written in the Common Workflow Language(CWL) format is possible with a few modifications to your workflow submission. For CWL workflows, all CWL resource keywords are supported, plus preemptible (not in CWL spec). preemptible defaults to true (set in Cromwell configuration file), so use preemptible only if setting it to false (run on dedicated machine). TES keywords are also supported in CWL workflows, but we advise users to use the CWL ones.

CWL keywords: (CWL workflows only)
coresMin: number
ramMin: size in MB
tmpdirMin: size in MB - Cromwell on Azure version 2.0 and above only
outdirMin: size in MB - Cromwell on Azure version 2.0 and above only
(the final disk size is the sum of tmpDir and outDir values)

TES keywords: (both CWL and WDL workflows)
preemptible: true|false

Cromwell on Azure version 1.x known issue for CWL files: Cannot request specific HDD size Unfortunately, this is actually a bug in how Cromwell currently parses the CWL files and thus must be addressed in the Cromwell source code directly. The current workaround for this is to increase the number of vCPUs or memory requested for a task, which will indirectly increase the amount of working disk space available. However, because this may cause inconsistent performance, we advise that if you are running a task that might consume a large amount of local scratch space, consider converting your workflow to the WDL format instead.