Update (October 2021) – Added more information about the workflows developed by the Broad Institute.
Less than 70 years separate us from one of the greatest discoveries of all time: the double helix structure of DNA. We now know that DNA is a sort of a twisted ladder composed of four types of compounds, called bases. These four bases are usually identified by an uppercase letter: adenine (A), guanine (G), cytosine (C), and thymine (T). One of the reasons for the double helix structure is that when these compounds are at the two sides of the ladder, A always bonds with T, and C always bonds with G.
If we unroll the ladder on a table, we’d see two sequences of “letters”, and each of the two sides would carry the same genetic information. For example, here are two series (AGCT
and TCGA
) bound together:
These series of letters can be very long. For example, the human genome is composed of over 3 billion letters of code and acts as the biological blueprint of every cell in a person. The information in a person’s genome can be used to create highly personalized treatments to improve the health of individuals and even the entire population. Similarly, genomic data can be use to track infectious diseases, improve diagnosis, and even track epidemics, food pathogens and toxins. This is the emerging field of environmental genomics.
Accessing genomic data requires genome sequencing, which with recent advances in technology, can be done for large groups of individuals, quickly and more cost-effectively than ever before. In the next five years, genomics datasets are estimated to grow and contain more than a billion sequenced genomes.
How Genomics Data Analysis Works
Genomics data analysis uses a variety of tools that need to be orchestrated as a specific sequence of steps, or a workflow. To facilitate developing, sharing, and running workflows, the genomics and bioinformatics communities have developed specialized workflow definition languages like WDL, Nextflow, CWL, and Snakemake.
However, this process generates petabytes of raw genomic data and experts in genomics and life science struggle to scale compute and storage resources to handle data at such massive scale.
To process data and provide answers quickly, cloud resources like compute, storage, and networking need to be configured to work together with analysis tools. As a result, scientists and researchers often have to spend valuable time deploying infrastructure and modifying open-source genomics analysis tools instead of making contributions to genomics innovations.
Introducing Amazon Genomics CLI
A couple of months ago, we shared the preview of Amazon Genomics CLI, a tool that makes it easier to process genomics data at petabyte scale on AWS. I am excited to share that the Amazon Genomics CLI is now an open source project and is generally available today. You can use it with publicly available workflows as a starting point and develop your analysis on top of these.
Amazon Genomics CLI simplifies and automates the deployment of cloud infrastructure, providing you with an easy-to-use command line interface to quickly setup and run genomics workflows on AWS. By removing the heavy lifting from setting up and running genomics workflows in the cloud, software developers and researchers can automatically provision, configure and scale cloud resources to enable faster and more cost-effective population-level genetics studies, drug discovery cycles, and more.
Amazon Genomics CLI lets you run your workflows on an optimized cloud infrastructure. More specifically, the CLI:
- Includes improvements to genomics workflow engines to make them integrate better with AWS, removing the burden to manually modify open-source tools and tune them to run efficiently at scale. These tools work seamlessly across Amazon Elastic Container Service (Amazon ECS), Amazon DynamoDB, Amazon Elastic File System (Amazon EFS), and Amazon Simple Storage Service (Amazon S3), helping you to scale compute and storage and at the same time optimize your costs using features like EC2 Spot Instances.
- Eliminates the most time-consuming tasks like provisioning storage and compute capacities, deploying the genomics workflow engines, and tuning the clusters used to execute workflows.
- Automatically increases or decreases cloud resources based on your workloads, which eliminates the risk of buying too much or too little capacity.
- Tags resources so that you can use tools like AWS Cost & Usage Report to understand the costs related to your genomics data analysis across multiple AWS services.
The use of Amazon Genomics CLI is based on these three main concepts:
Workflow – These are bioinformatics workflows written in languages like WDL or Nextflow. They can be either single script files or packages of multiple files. These workflow script files are workflow definitions and combined with additional metadata, like the workflow language the definition is written in, form a workflow specification that is used by the CLI to execute workflows on appropriate compute resources.
Context – A context encapsulates and automates time-consuming tasks to configure and deploy workflow engines, create data access policies, and tune compute clusters (managed using AWS Batch) for operation at scale.
Project – A project links together workflows, datasets, and the contexts used to process them. From a user perspective, it handles resources related to the same problem or used by the same team.
Let’s see how this works in practice.
Using Amazon Genomics CLI
I follow the instructions to install Amazon Genomics CLI on my laptop. Now, I can use the agc
command to manage genomic workloads. I see the available options with:
The first time I use it, I activate my AWS account:
This creates the core infrastructure that Amazon Genomics CLI needs to operate, which includes an S3 bucket, a virtual private cloud (VPC), and a DynamoDB table. The S3 bucket is used for durable metadata, and the VPC is used to isolate compute resources.
Optionally, I can bring my own VPC. I can also use one of my named profiles for the AWS Command Line Interface (CLI). In this way, I can customize the AWS Region and the AWS account used by the Amazon Genomics CLI.
I configure my email address in the local settings. This wil be used to tag resources created by the CLI:
There are a few demo projects in the examples folder included by the Amazon Genomics CLI installation. These projects use different engines, such as Cromwell or Nextflow. In the demo-wdl-project
folder, the agc-project.yaml
file describes the workflows, the data, and the contexts for the Demo
project:
---
name: Demo
schemaVersion: 1
workflows:
hello:
type:
language: wdl
version: 1.0
sourceURL: workflows/hello
read:
type:
language: wdl
version: 1.0
sourceURL: workflows/read
haplotype:
type:
language: wdl
version: 1.0
sourceURL: workflows/haplotype
words-with-vowels:
type:
language: wdl
version: 1.0
sourceURL: workflows/words
data:
- location: s3://gatk-test-data
readOnly: true
- location: s3://broad-references
readOnly: true
contexts:
myContext:
engines:
- type: wdl
engine: cromwell
spotCtx:
requestSpotInstances: true
engines:
- type: wdl
engine: cromwell
For this project, there are four workflows (hello
, read
, words-with-vowels
, and haplotype
). The project has read-only access to two S3 buckets and can run workflows using two contexts. Both contexts use the Cromwell engine. One context (spotCtx
) uses Amazon EC2 Spot Instances to optimize costs.
In the demo-wdl-project
folder, I use the Amazon Genomics CLI to deploy the spotCtx
context:
After a few minutes, the context is ready, and I can execute the workflows. Once started, a context incurs about $0.40 per hour of baseline costs. These costs don’t include the resources created to execute workflows. Those resources depend on your specific use case. Contexts have the option to use spot instances by adding the requestSpotInstances
flag to their configuration.
I use the CLI to see the status of the contexts of the project:
Now, let’s look at the workflows included in this project:
The simplest workflow is hello
. The content of the hello.wdl
file is quite understandable if you know any programming language:
The hello
workflow defines a single task (hello
) that prints the output of a command. The task is executed on a specific container image (ubuntu:latest
). The output is taken from standard output (stdout
), the default file descriptor where a process can write output.
Running workflows is an asynchronous process. After submitting a workflow from the CLI, it is handled entirely in the cloud. I can run multiple workflows at a time. The underlying compute resources will automatically scale and I will be charged only for what I use.
Using the CLI, I start the hello
workflow:
The workflow was successfully submitted, and the last line is the workflow execution ID. I can use this ID to reference a specific workflow execution. Now, I check the status of the workflow:
The hello
workflow is still running. After a few minutes, I check again:
The workflow has terminated and is now complete. I look at the workflow logs:
In the logs, I find as expected the Hello Amazon Genomics CLI!
message printed by workflow.
I can also look at the content of hello-stdout.log
on S3 using the information in the log above:
It worked! Now, let’s look for at more complex workflows. Before I change project, I destroy the context for the Demo
project:
In the gatk-best-practices-project
folder, I list the available workflows for the project:
This project includes workflows based on GATK Best Practices, developed by the Broad Institute. More information on how these workflows work is available in the GATK Workflows GitHub repository.
In the agc-project.yaml
file, the gatk4-data-processing
workflow points to a local directory with the same name. This is the content of that directory:
This workflow processes high-throughput sequencing data with GATK4, a genomic analysis toolkit focused on variant discovery.
The directory contains a MANIFEST.json
file. The manifest file describes which file contains the main workflow to execute (there can be more than one WDL file in the directory) and where to find input parameters and options. Here’s the content of the manifest file:
In the gatk-best-practices-project
folder, I create a context to run the workflows:
Then, I start the gatk4-data-processing
workflow:
After a couple of hours, the workflow has terminated:
I look at the logs:
Results have been written to the S3 bucket created during the account activation. The name of the bucket is in the logs but I can also find it stored as a parameter by AWS Systems Manager. I can save it in an environment variable with the following command:
Using the AWS Command Line Interface (CLI), I can now explore the results on the S3 bucket and get the outputs of the workflow.
Before looking at the results, I remove the resources that I don’t need by stopping the context. This will destroy all compute resources, but retain data in S3.
Additional examples on configuring different contexts and running additional workflows are provided in the documentation on GitHub.
Availability and Pricing
Amazon Genomics CLI is an open source tool, and you can use it today in all AWS Regions with the exception of AWS GovCloud (US) and Regions located in China. There is no cost for using the AWS Genomics CLI. You pay for the AWS resources created by the CLI.
With the Amazon Genomics CLI, you can focus on science instead of architecting infrastructure. This gets you up and running faster, enabling research, development, and testing workloads. For production workloads that scale to several thousand parallel workflows, we can provide recommended ways to leverage additional Amazon services, like AWS Step Functions, just reach out to our account teams for more information.
— Danilo