Read the pre-print here: https://www.biorxiv.org/content/10.1101/2023.12.31.573788v2
This repo contains the environment and Snakemake pipeline needed to enact the main workflow in Figure 1, that is used throughout the paper
- Local installation of Singularity >= 3.10
- Local installation of Python >= 3.10
./setup.sh
source venv_snakemake/bin/activate
snakemake --help
3. Initialize Singularity container for OrthoFinder, HMMER, cath-resolve-hits + all required R packages
sudo singularity build container.sif container.def
NOTE: you may see a message like "System has not been booted with systemd as init system (PID 1). Can't operate."
This does NOT affect our workflow, and only concerns datetime operations with R tidyverse (which we don't use)
cd workflow
snakemake --cores all --use-singularity singularity_test
1. Put your proteome fasta files of interest into data/proteomes. Four partial sample proteomes are included for testing
3. Download Pfam 35.0 from https://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/, download Pfam-A.hmm.gz and Pfam-A.clans.tsv.gz, extract then move to data/pfam (or as specified in snakemake_config.yaml)
cd workflow # assuming you're not already in that folder
snakemake --cores all --use-singularity
Note: these scripts require that you have already run the main workflow, so all orthogroups and domain architectures have already been assigned to your input proteomes.
Fig 2A - workflow/scripts/plot_ortholog_length_distributions.R
Fig 2B - workflow/scripts/plot_domain_arch_change_freqs.R
Fig 2C - workflow/scripts/compare_domain_and_linker_lengths.R
Fig 3B, 3C, 3D - workflow/scripts/classify_lost_c_terminal_residues.R
Fig 4B - workflow/scripts/annotate_domain_essentiality_with_ptcs.R
Fig 5A - workflow/scripts/domain_arch_hierarchical_clustering.R
Fig 5B - workflow/scripts/create_lost_doms_tsne.R
workflow/scripts/annotate_domain_essentiality_with_ptcs.R
workflow/scripts/plot_ortholog_length_distributions.R
workflow/scripts/compare_ortholog_lengths_to_percent_identities.R