The Micromeda platform allows users to predict which genome properties are possessed by organisms and then compare the presence and absence of these properties across organisms. The platform has three core components:
- Micromeda-Visualizer -- A web-based visualization tool that draws interactive heat maps of genome property and property step assignments. It has two components Micromeda-Client and Micromeda-Server and is available at micromeda.uwaterloo.ca.
- Pygenprop -- A Jupyter Notebooks compatible Python library that allows for the programmatic comparison of property and step assignments across organisms. Pygenprop also allows users to explore the InterProScan annotations and protein sequences that support genome property assignments. Pygenprop also has a command-line interface (CLI) for generating Micromeda files.
- Micromeda Files -- These files allow for the aggregation and transfer of genome property assignments, step assignments, and supporting InterProScan annotations and protein sequences from multiple organisms. They allow for the transfer of complete property analysis datasets between researchers. They are also used to transfer datasets to Micromeda-Visualizer.
Analyzing datasets using Micromeda involves the following steps:
- Acquiring an organism's protein sequences
- Annotating an organism's proteins using InterProScan5
- Creating a Micromeda file using Pygenprop's CLI
- Uploading the Micromeda file to Micromeda-Visualizer
Micromeda files take FASTA protein sequences and InterProScan5 out files as inputs. Protein sequences are predicted by via gene prediction software such as Prodigal.
The following pieces of software must be installed to generate Micromeda files:
The following pieces of software are used in the tutorial but are optional:
InterProScan5 takes an organism's predicted proteins sequences as input. Genes must first be predicted using a gene prediction application (e.g., https://en.wikipedia.org/wiki/List_of_gene_prediction_software) to get these proteins and then translated to proteins either using the same software or the second piece of software. Different types of software must be used on eukaryotic vs prokaryotic genomes. For this tutorial, it is assumed that prokaryotic organisms are being analyzed. Prodigal is used to predict these organism's proteins.
InterProScan5 and Prodigal are more easily installed when you have the following installed:
NOTE: Docker for Mac is not a native MacOS application but instead runs inside a Linux virtual machine. By default, this VM is only allocated half the total CPUs and 2 GB of RAM. InterProScan may require more resources, such as additional RAM or CPU. Users can follow the instructions in the Docker for Mac documentation to adjust RAM or CPU allocations, available at https://docs.docker.com/desktop/settings/mac/.
InterProScan5 can be installed manually or it can be easily installed using our Docker-based installation.
docker build https://raw.githubusercontent.com/Micromeda/InterProScan-Docker/master/Dockerfile -t micromeda/interproscan-docker
You should also install our wrapper script that allows you to run InterProScan on files found outside of its container.
wget https://raw.githubusercontent.com/Micromeda/InterProScan-Docker/master/run_docker_interproscan.sh
chmod +x run_docker_interproscan.sh
pip install numpy # Numpy needs to be installed separately
pip install pygenprop
or alternatively
conda install -c conda-forge -c lbergstrand pygenprop
Prodigal can either be installed using the author's installation tutorial. Alternatively, Prodigal can be installed using Conda.
conda install -c bioconda Prodigal
GNU parallel can be used to parallelize some workflows such as gene prediction across multiple processes.
# Debian/Ubuntu
apt-get install parallel
# OSX
brew install parallel
wget https://raw.githubusercontent.com/ebi-pf-team/genome-properties/master/flatfiles/genomeProperties.txt
Below is a tutorial that overviews how to build a single Micromeda file for one prokaryotic organism. Generating Micromeda file for multiple organisms will be discussed later in the document.
Use Prodigal to predict the organism's proteins. The -a
flag is used to write the predicted protein sequences to a file.
prodigal -i ./genome.fasta -a ./genome.faa
Prodigal adds *
characters, representing stop codons, to the end of its output protein sequences. The *
character is not in the IUPAC protein alphabet. As a result, InterProScan5 throws an error when annotating Prodigal's output sequences. These *
characters need to be removed to make the previously produced .faa
file compatible with InterProScan5. Alternative gene prediction programs may produce .faa
files without *
characters. When using these program's output, the ```sed`` step below can be skipped.
Use sed
to remove these *
characters.
sed -i 's/\*$//' genome.faa
Use the InterProScan5 Docker container to domain annotate the previously sanitized .faa
file. For convenience, one can use run_docker_interproscan.sh
, which simplifies using the container.
./run_docker_interproscan.sh genome.faa
This step produces an InterProScan .tsv
annotation file called genome.tsv
.
When Pygenprop is installed, its CLI is also installed and can be used to build Micromeda files, among a few other functions. The CLI's build
command takes the previous output .tsv
file generated by run_docker_interproscan.sh
as input.
pygenprop build -d ./genomeProperties.txt -i genome.tsv -o ./data.micro -p
The build
command's -p
flag is used to add protein sequences to the output Micromeda file. With this flag active, Pygenprop searches the FASTA files that were scanned by InterProScan for proteins that support genome property steps and adds them to the output Micromeda file. The FASTA files must be in the same directory as the InterProScan5 files and share the same basename (e.g., filename without file extension).
As discussed above, creating a Micromeda file involves converting an input genome files through a series of steps.
genome.fasta --> genome.faa --> genome.tsv --> data.micro
The above steps can be applied to multiple genomes to create more massive analysis datasets. Below we use parallel
to parallelize specific steps across multiple cores. However, the same task could also be accomplished by placing the above commands, except for step four, in a shell script that loops through a series of input FASTA file paths.
For the steps below, let us assume that we have the following directory structure:
data/
├── ecoli_one.fasta
├── ecoli_two.fasta
The find . -maxdepth 1 -name "*.fasta"
command finds all the FASTA files in the current working directory. This list is piped to parallel
, which runs a Prodigal process on each file in parallel.
find . -maxdepth 1 -name "*.fasta" | parallel prodigal -i {} -a {.}.faa
Resulting Folder Structure
data/
├── ecoli_one.fasta
├── ecoli_one.faa
├── ecoli_two.fasta
├── ecoli_two.faa
Find all the .faa
files and run sed`` on them in parallel to remove
*``` characters.
find . -maxdepth 1 -name "*.faa" | parallel sed -i 's/\*$//' {}
Resulting Folder Structure N/A
Find all the .faa
files and run InterProScan on them. The -j 1
flag of parallel
ensures that only one copy of InterProScan is run at a time (equivalent to xargs). Because InterProScan is already multi-threaded we don't need to run muliple processes.
find . -maxdepth 1 -name "*.faa" | parallel -j 1 ./run_docker_interproscan.sh {}
Resulting Folder Structure
data/
├── ecoli_one.fasta
├── ecoli_one.faa
├── ecoli_one.tsv
├── ecoli_two.fasta
├── ecoli_one.tsv
├── ecoli_two.faa
Build an output Micromeda file from multiple input InterProScan .tsv
files.
pygenprop build -d ./genomeProperties.txt -i *.tsv -o ./data.micro -p
Resulting Folder Structure
data/
├── ecoli_one.fasta
├── ecoli_one.faa
├── ecoli_one.tsv
├── ecoli_two.fasta
├── ecoli_one.tsv
├── ecoli_two.faa
├── data.micro
Micromeda-Visualizer is available at micromeda.uwaterloo.ca. Users can upload Micromeda files to Micromeda-Visualizer for visualization via a drag and drop interface. The steps for creating Micromeda heat map visualizations are overviewed below.
Navigate to micromeda.uwaterloo.ca.
Clicking the upload button will redirect the browser to the Micromeda file upload page.
Upload a Micromeda file using the drag and drop zone. Alternatively, the drop zone can be clicked to bring up a file selection menu.
It may take a few minutes to tens of minutes for the Micromeda file to be uploaded and processed. After upload and processing are complete, the browser window will automatically be redirected to the heat map display page. Do not navigate away from this page manually after file upload or all progress is lost. Note that the upload progress bar may show 100%, but the page will not be redirected until the Micromeda file completely processed on the server. Please have the patience for the page redirect when uploading large Micromeda files (>30 organisms). The time taken processing the file on the server grows exponentially with input file size.
After server-side processing, the heat map will be available for viewing. Only a single heat map can be viewed at one time. To create a new heat map or view a previous one, their corosponding Micromeda files must be reuploaded. Micromeda files and heat maps are only stored on the server for a two hour period. Afterward, the original Micromeda file must be reuploaded.
Before installing Micromeda-Server you should install Docker and Docker Compose.
Steps:
-
Download the docker compose file.
wget https://raw.githubusercontent.com/Micromeda/micromeda-server/master/docker-compose.yml
-
Edit the Docker Compose file:
For the line:
- BACKEND_URL=http://0.0.0.0:5000/
Replace
0.0.0.0
with the server's recognized URL (e.g., micromeda.uwaterloo.ca). If you are running Micromeda-Server on your personal computer, then leave0.0.0.0
in place.- BACKEND_URL=http://micromeda.uwaterloo.ca:5000/
-
Build the front-end and back-end.
docker-compose build
-
Run the front-end and back-end.
docker-compose up
Note: You may want to run docker-compose up as a background process.
docker-compose up -d
-d
is for detached (i.e., background) mode.