Protein Tertiary Structure Prediction

This project aims at reproducing selected part of Mohammed AlQuraishi's work on End-to-end differentiable learning of protein structure (https://www.biorxiv.org/content/early/2018/08/29/265231), and Gao et al. on RaptorX (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2065-x).

Report with the results: https://drive.google.com/file/d/1-SFavU5i6XlHK2sswy60k5TowgButezy/view

In the main folder you'll find notebooks that show examples of how to use the model.

Note:

The training_50_dih.joblib and validation_dih.joblib files are not available because they have to be generated from txt version of the data. The data is in its raw format represented as 3d vectors, while the files expected by the txt pipeline are protein representation converted to dihedral angles. Thus if you can't generate these files yourself I recommend working with the full tensor-based pipeline (Full pipeline - tensor data.ipynb or the model files directly) instead of the txt data.

Model configuration details

This tesnorflow model uses ProteinNet dataset (in the tensor version) as available in the preliminary release here: https://github.com/aqlaboratory/proteinnet

Input

Input is comprised of aminoacid sequences and evolutionary data (PSSM) and parsing is done through the DataHandler object, which is written in the old queue paradigm (instead of the new tensorflow Data Pipeline).

Files

Files to consider as trianing inputs are decided based on following variables:

data_path: path to the ProteinNet containing casps (each casp then contains training, validation and test folders)
casps: a list of strings defining which casps should be loaded
percentages: a list of integers defining which structure identity clusters should be loaded

Features

Controlled fully by a boolean: include_evo, that controlls if evolutionary features should be used together with aminoacid sequences.

Model

The model is controlled using the Model class defined in the model.py.

Its behaviour is fully determined by a set of arguments passed to the contructors: n_angles, model_type, prediction_mode, ang_mode, loss_mode,dropout_rate.

n_angles: 2 if should predict only phi and psi and 3 if phi, psi, and omega
model_type: see Core (below)
prediction_mode: see Prediction (below)
ang_mode: see Predictions and corresponding loss modes -> Angularization
loss_mode: see Predictions and corresponding loss modes
dropout_rate: controlls the regularization applied to the core model
regularize_vectors: controlls if regularization loss should be applied to vectors to keep them on unit circle (only available in 'regression_vectors' mode)

Core

Both CNNs are composed of resnet type architecture with residual connections in between layers, batch normalization after each layer and dropout after every second layer.

Filter numbers (neurons per layer) start at 32 and are incrementally doubled every 2 layers. Filter size is fixed at 5.

Modes:

cnn_big: 8 layers
cnn_small: 6 layers
bilstm: bidirectional lstm. 1 layer, 128 neurons

Predictions and corresponding loss modes

Modes:

regression_angles

n_angles values predicted in a dense layer, piped through tanh or cos and multiplied by pi to fit radian range

Available Angularization Modes: 'tanh' or 'cos'

Available loss modes: 'angular_mae' or 'mae', both are applied to angles
regression_vectors

n_angles*2 values predicted in a dense layer, converted to angles by passing through an atan2 function

Available loss modes: 'angular_mae' or 'mae'. Angular mae is applied to angles, mae is applied to vectors.
alphabet_angles

n_angles values predicted by calculating a weighted average of an alpahabet of n_clusters size and a probability disttribution over that alphabet

Available loss modes: 'angular_mae' or 'mae', both are applied to angles
alphabet_vectors

As in alphabet_angles but first the network predicts 2 values per angle and then atan2 is applied as in regression vectors.

Available loss modes: 'angular_mae' or 'mae'. Angular mae is applied to angles, mae is applied to vectors.

Angularization

Depending on the Prediction mode, angualrization mode might need to be specified.

regression_vectors: has angularization included in its transformations and has no options to choose from.

regression_angles: continuous value predicted by a linear dense layer is piped through either tanh or cos, specified by the ang_mode argument.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
model		model
notebook_archive		notebook_archive
txt_data_utils		txt_data_utils
.gitignore		.gitignore
Dihedral angles.ipynb		Dihedral angles.ipynb
Full pipeline - tensor data.ipynb		Full pipeline - tensor data.ipynb
Full pipeline - txt data.ipynb		Full pipeline - txt data.ipynb
README.md		README.md
le.joblib		le.joblib
ohe.joblib		ohe.joblib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Tertiary Structure Prediction

Note:

Model configuration details

Input

Files

Features

Model

Core

Predictions and corresponding loss modes

Angularization

About

Releases

Packages

Languages

m3h0w/protein-dihedral-angles-prediction

Folders and files

Latest commit

History

Repository files navigation

Protein Tertiary Structure Prediction

Note:

Model configuration details

Input

Files

Features

Model

Core

Predictions and corresponding loss modes

Angularization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages