Skip to content

Latest commit

 

History

History
116 lines (78 loc) · 5.94 KB

README.md

File metadata and controls

116 lines (78 loc) · 5.94 KB

Protein Structure Transformer

The repository implements the Protein Structure Transformer (PST). The PST model endows the pretrained protein sequence model ESM-2 with structural knowledge, allowing for extracting representations of protein structures. Full details of PST can be found in the paper.

Citation

Please use the following to cite our work:

@misc{chen2024endowing,
	title={Endowing Protein Language Models with Structural Knowledge}, 
	author={Dexiong Chen and Philip Hartout and Paolo Pellizzoni and Carlos Oliver and Karsten Borgwardt},
	year={2024},
	eprint={2401.14819},
	archivePrefix={arXiv},
	primaryClass={q-bio.QM}
}

Overview of PST

PST uses a structure extractor to incorporate protein structures into existing pretrained protein language models (PLMs) such as ESM-2. The structure extractor adopts a GNN to extract subgraph representations of the 8Å-neighborhood protein structure graph at each residue (i.e., nodes on the graph). The resulting residue-level subgraph representations are then add to the $Q$, $K$ and $V$ matrices of each self-attention block of any (pretrained) transformer model (here we use ESM-2) pretrained on larger corpuses of sequences. We name the resulting model PST, which can be trained on any protein structure dataset, by either updating the full model weights or only the weights in the structure extractor. The pretraining dataset could be much smaller than the pretraining dataset used for the base sequence model, e.g., SwissProt with only 542k protein structures.

Below you can find an overview of PST with ESM-2 as the sequence backbone. The ESM-2 model weights were frozen during the training of the structure extractor. The structure extractor was trained on AlphaFold SwissProt, a dataset of 542K proteins with predicted structures. The resulting PST model can then be finetuned on a downstream task, e.g., torchdrug or proteinshake tasks. PST can also be used to simply extract representations of protein structures.

Overview of PST

Pretrained models

Model name Sequence model #Layers Embed dim Notes Model URL
pst_t6 esm2_t6_8M_UR50D 6 320 Standard link
pst_t6_so esm2_t6_8M_UR50D 6 320 Train struct only link
pst_t12 esm2_t12_35M_UR50D 12 480 Standard link
pst_t12_so esm2_t12_35M_UR50D 12 480 Train struct only link
pst_t30 esm2_t30_150M_UR50D 30 640 Standard link
pst_t30_so esm2_t30_150M_UR50D 30 640 Train struct only link
pst_t33 esm2_t33_650M_UR50D 33 1280 Standard link
pst_t33_so esm2_t33_650M_UR50D 33 1280 Train struct only link

Usage

Installation

The dependencies are managed by mamba or conda

mamba env create -f environment.yaml 
mamba activate pst
pip install -e .

Optionally, you can install the following dependencies to run the experiments:

pip install torchdrug

Quick start: extract representations of protein structures using PST

You can PST to simply extract representations of protein structures stored in PDB files. Just run

python scripts/pst_extract.py --help

If you want to work with your own dataset, just create a my_dataset directory in scripts and put all the PDB files into my_dataset/raw/, and run:

python scripts/pst_extract.py --datadir ./scripts/my_dataset --model pst_t33_so --include_seq

Use PST for protein function prediction

You can use PST to perform Gene Ontology prediction, Enzyme Commission Number prediction and any other protein function prediction tasks.

Fixed representations

To train an MLP on top of the representations extracted by the pretrained PST models for Enzyme Commission prediction, run:

python experiments/fixed/predict_gearnet.py dataset=gearnet_ec # dataset=gearnet_go_bp, gearnet_go_cc or gearnet_go_mf for GO prediction

Finetune PST

To finetune the PST model for function prediction tasks, run:

python experiments/finetune/finetune_gearnet.py dataset=gearnet_ec # dataset=gearnet_go_bp, gearnet_go_cc or gearnet_go_mf for GO prediction

Pretrain PST on AlphaFold Swissprot

Run the following code to train a PST model based on the 6-layer ESM-2 model by only training the structure extractor:

python train_pst.py base_model=esm2_t6 model.train_struct_only=true

You can replace esm2_t6 with esm2_t12, esm2_t30, esm2_t33 or any pretrained ESM-2 model.

Reproducibility datasets

We have folded structures that were not available in the PDB for our VEP datasets. You can download the dataset from here, and unzip it in ./datasets, provided your current path is the root of this repository. Similarly, download the SCOP dataset here.