This is the repo for the evaluation pipeline of alphafold predictions.

Install

Please ensure you have conda installed on your machine. Miniconda is sufficient for this pipeline. Then create a conda virtual environment via conda env create --file environment.yml. Activate the environment using conda activate protein_env.

Data

Directory

The pipeline defaults to certain data directory structure shown as follows:

data_dir
   |
   |_input
   |   |
   |   |_dataset_name
   |   |        |
   |   |        |_pdbs (pdb files can be automatically downloaded)
   |   |        |
   |   |        |_source_fasta (fasta files require manual download)
   |   |        |
   |   |        |_(poly_g_20) (linked fasta if poly glycine link strategy is chosen)
   |   |
   |   |_dataset_spec_1 (json file)
   |   |_dataset_spec_2 (json file)
   |   |_...
   |
   |_output
       |
       |_dataset+alphafold_model_name
                 |
                 |_strategy_name (e.g. poly_g_20)
                        |
                        |_individual pdb prediction (e.g. 1C8O.fasta)
                                     |
                                     |_ranked_0.json (requires manual download)
                                     |
                                     |_ranking_debug.json (requires manual download)

An example data directory (sampled_data) is provided.

sampled_data/input/ds1/pdbs stores three sample pdb proteins.
sampled_data/input/idr/poly_g_20 stores the corresponding amino acids sequence (fasta format) for each of these three pdb file (poly glycine linker length 20).
sampled_data/output/ds1_af_full/poly_g_20/{pdb_id}.fasta stores prediction files of alphafold for the corresponding protein.

Dataset

Four example dataset are provided:

idr dataset 1: This dataset has proteins posted after 2021.
idr dataset 2: This dataset has proteins posted up until 2018-04-30 (spec file rg-dimers-to-2018-04-30.json).
idr dataset 3: This dataset has proteins posted from 2018-05-01 (spec file rg-dimers-from-2018-05-01.json).
idr dataset 4: This dataset has proteins posted up until 2020-05-15 (spec file rg-dimers-from-2020-05-15.json).

Run

Input processing

Input processing deals with pdb and fasta file download and also gathers necessary info about the pdb (chain id, boundary residue id of each chain, and start residue id of each chain for each protein) which are necessary for the later output processing.

pdb download and processing

For ground-truth pdbs, either they are pre-downloaded and stored under data/input/{dataset_name}/pdbs ¹ or automatic download should be enabled. In case of automatic download, if a dataset spec file (json file that specifis, for each pdb id, relevant info incl. chain id etc.) is provided ², the pipeline will parse this file to get a list of pdb ids to download. Otherwise ³, a plain text file of comma separated pdb ids must be provided and stored under data/input/{dataset_name} for downloading. Note, sometimes pdb downloading may be interrupted and error will be incurred. If this happened, simply re-run the input processing step. The pipeline will only download the rest pdbs.

¹: download_pdbs=False
²: download_pdbs=True, has_dataset_spec=True
³: download_pdbs=True, has_dataset_spec=False

fasta download and processing

Fasta download can be skipped if you are using multimer strategy and have already downloaded fasta files for prediction¹. Otherwise, the pipline supports either generating fasta files via extracting sequences from ground-truth pdb files ² or users have to manually download fasta files from rcsb.org and store them under data/input/{dataset_name}/source_fasta. Note that rcbs.org have separate fasta files for each chain of a single protein. Depending on the prediction strategy, further processing may be necessary. If poly glycine linker strategy is used ³, two chains of a single protein needs to be linked together by poly glycine residues. If alphafold multimer strategy is used (and you haven't downloaded fasta files yet) ⁴, no further processing is required. Finally, please upload the fasta files yourself to the server where you are running alphafold.

¹: skip_fasta_download=True, strategy=multimer
²: skip_fasta_download=False, generate_fasta_from_pdb=True
³: skip_fasta_download=False, generate_fasta_from_pdb=False, strategy=poly_g_link
⁴: skip_fasta_download=False, generate_fasta_from_pdb=False, strategy=multimer

gather other info

For poly glycine linker strategy, as the linker needs to be removed during output processing, we need to record id of the 1st residue for each chain of each protein. This is done during the linking process. If you choose the multimer strategy, this processing will not be incurred.

As sequences from downloaded fasta files generally differ in a few residues from the ground-truth pdbs, it's better (required for pymol) that both predicted and ground-truth pdbs have the extra residues removed¹. Moreover, alphafold predictions have residue ids being contiguous and starting from 1. While first residue ids of ground-truth pdbs are not necessarily 1 and residue ids between chains may not be contiguous. To renumber alphafold prediction residue ids ² s.t. they are the same as ground-truth pdbs, we need to record id of the first and last residue of each chain for each ground-truth pdb. The info gathering process is done during input processing.

¹: remove_extra_residues=True
²: renumber_residues=True

Output processing

Please first download alphafold prediction files yourself and store the most confident prediction (ranked_0.pdb) and ranking metrics file (ranking_debug.json) under data/output/{dataset_name_alphafold_name}/{strategy}/{pdb_id}.fasta. Please refer to sample_data directory for examples.

If poly glycine linker strategy is used ¹, output processing removes linker from the prediction file. Otherwise ², prediction file will not be touched unless residue removal and renumbering are required.

¹ strategy=poly_g_link
² strategy=multimer

Then the pipeline can calculate metircs in two ways

Using pymol to superimpose the receptor of the ground-truth and predicted protein and then calculate rmsd values for the ligand ¹.
Using Biopython to calculate iRMS (interface RMSD for receptor), LRMS (ligand RMSD), and dockQ score (adapted from https://github.com/bjornwallner/DockQ) ².

¹ dockq=False
² dockq=True

Command

configs/sample.ini contains all necessary configurations to run the pipeline. For example, to run input processing, simply uncomment operations=[process_input] (keep other operations commented).

To run the pipeline, call python main.py --config configs/sample.ini.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
archive		archive
configs		configs
modules		modules
sample_data		sample_data
scripts		scripts
utils		utils
.gitignore		.gitignore
Note.txt		Note.txt
environment.yml		environment.yml
main.py		main.py
parser.py		parser.py
readme.md		readme.md
run_pipeline.sh		run_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is the repo for the evaluation pipeline of alphafold predictions.

Install

Data

Directory

Dataset

Run

Input processing

pdb download and processing

fasta download and processing

gather other info

Output processing

Command

About

Releases

Packages

Languages

Fred62879/proteinFold

Folders and files

Latest commit

History

Repository files navigation

This is the repo for the evaluation pipeline of alphafold predictions.

Install

Data

Directory

Dataset

Run

Input processing

pdb download and processing

fasta download and processing

gather other info

Output processing

Command

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages