DeepPhylo: Phylogeny‐Aware Microbial Embeddings Enhanced Predictive Accuracy in Human Microbiome Data Analysis

DeepPhylo, a method that employs phylogeny-aware amplicon embeddings to integrate abundance and phylogenetic information, thereby improving both the unsupervised discriminatory power and supervised predictive accuracy of microbiome data analysis.

Compared to the existing methods, DeepPhylo demonstrated superiority in informing biologically relevant insights across five real-world microbiome use cases, including clustering of skin microbiomes, prediction of host chronological age and gender, diagnosis of inflammatory bowel disease (IBD) across 15 studies, and multilabel disease classification.

This repository contains script which were used to train the DeepPhylo model with the scripts for evaluating the model's performance.

Dependencies

The code was developed and tested using python 3.9.
To install python dependencies run: pip install -r requirements.txt

Installation

You can either clone the public repository:

# clone project
git clone https://github.com/CNwangbin/DeepPhylo
# First, install dependencies
pip install -r requirements.txt

Once you have a copy of the source, you can install it with:

python setup.py install

Data file descriptions

txt :A txt file is used to store metadata information for the skin_clustering analysis. This plain text file might include various attributes or classifications related to the samples, serving as input for clustering algorithms.

sample_name,ac_sampled_room,accult_score,analysis_name,animals_in_house
10333.BLANK.1.1A,not applicable,not applicable,BLANK1,not applicable

npy: A npy file is the result of converting a biom file. It contains numerical data representing the abundance of different features in the dataset, stored in a format optimized for fast loading and processing with NumPy.Its value is the result of normalized absolute abundance.

microbe1	microbe2	microbe3
0.26	0.15	0.01
0.1	0.2	0.24
0.3	0.05	0.04

biom: A biom file represents the absolute abundance of OTUs (Operational Taxonomic Units) obtained from 16S sequencing. This format is widely used in microbial ecology for storing abundance data along with sample and feature metadata.

#OTU ID	Sample1	Sample2	...
microbe1	120	52	...
microbe2	32	168	...
...	...	...	...

qza: A qza file contains phylogenetic tree information, typically used in bioinformatics workflows. This file is a standard format in the QIIME 2 ecosystem, encapsulating data and metadata in a compressed archive.

QZA File Name	Description	Data Type	Usage in Analysis
`taxonomy.qza`	Contains the taxonomic assignments for each feature in the dataset. This file maps the features (e.g., OTUs, ASVs) to their respective taxonomic classifications, such as phylum, class, order, family, genus, and species.	Taxonomic Data	Used for generating taxonomic bar plots, summarizing the composition of microbial communities, and performing differential abundance analysis.

pth: A pth file stores the best model parameters. This file is used in deep learning frameworks like PyTorch to save and load model states, allowing for model checkpoints and fine-tuning.

File Name	Description	Data Type	Usage in Analysis
`/home/syl/DeepPhylo/data/age_regression/best_model.pth`	Stores the best model parameters learned during training. This file contains weights, biases, and other parameters that define the state of a deep learning model at a particular checkpoint.	Trained model Weights	Used for loading the trained model to make predictions, continue training, or perform fine-tuning on new data.

Data convert

To perform the necessary data transformation, we require the original biom abundance table as an input. This table will serve as the basis for subsequent operations, where it will be converted into an npy file format. Please ensure that the biom file is provided so that we can proceed with the transformation process seamlessly.


python deepphylo/data_convert.py --input_file="/home/syl/DeepPhylo/data/skin_clustering/urbmerged.biom" --output_dir=/home/syl/DeepPhylo/data/output_data

Model Parameter

Parameter	Value	Description
`-hs`	16	Hidden size: Number of units in the hidden layers of the neural network. Determines the capacity of the model to learn complex patterns.
`-kec`	7	Kernel size: Size of the convolutional kernel used in the model, defining the receptive field of the network's filters.
`-l`	1e-4	Learning rate: Controls the step size at each iteration while moving toward a minimum of the loss function. A smaller value makes the learning process slower and more precise.
`-bs`	64	Batch size: Number of training examples utilized in one iteration. A larger batch size generally leads to more stable gradient updates.
`-kep`	2	Kernel pool size: The size of the window to take a max over. This parameter defines the length of the 1D pooling window in the `nn.MaxPool1d` layer
`-d`	0.2	Dropout rate: Fraction of the input units to drop during training, used as a regularization method to prevent overfitting.
`-act`	'sigmoid'	Activation function: Type of activation function applied to the network's output, with 'sigmoid' used for binary classification problems.
`-test_X`	data/gender_classification/X_test.npy'	The input data of the model,it should be the type of `npy`.
`-test_Y`	data/gender_classification/Y_test.npy'	The label of the input data,it should be the type of `npy`.

In our model, the inputs are npy format inputs and labels, and the output will change according to your task selection. If you choose the binary classification task, evaluation metrics such as acc, aupr, roc_auc will be output between the predicted and true values. If you choose the regression task, R2_Score will be returned.

In terms of parameter selection, when you need to train a model on your own, I think hidden size and lr are important choices. You can focus on fine-tuning these two parameters. In addition, kep and kec represent the size of the convolutional layer and pooling layer in the model, respectively. This needs to be adjusted according to the actual input, but one thing is that odd numbers are better choices

Running

Download all the data files and place them into data folder

Scripts

The scripts here are using to run the model.

to train and evaluate a model using DeepPhylo in gender prediction run sh:

python deepphylo_classification.py --epochs 500 -hs 80 -kec 3 -l 0.0001 -bs 32 -kep 7 -act relu

to test a model using DeepPhylo in gender prediction run sh:

python deepphylo_classification_inference.py -test_X 'data/gender_classification/X_test.npy' -test_Y 'data/gender_classification/Y_test.npy'  -hs 80 -kec 3 -l 0.0001 -bs 32 -kep 7 -act relu

to train and evaluate a model using DeepPhylo in age prediction run sh:

python deepphylo_regression.py --epochs 500 -hs 40 -kec 7 -l 0.0002 -bs 8 -kep 7 -act tanh

to test a model using DeepPhylo in age prediction run sh:

python deepphylo_regression_inference.py -test_X '/home/syl/DeepPhylo/data/age_regression/X_test.npy' -test_Y 'data/age_regression/Y_test.npy' -hs 40 -kec 7 -l 0.0002 -bs 8 -kep 7 -act tanh

to train and evaluate a model using DeepPhylo in IBD microbiome-based diagnosis run sh:

python deepphylo_ibd_diagnosis.py -hs 16 -kec 7 -l 1e-4 -bs 64 -kep 2 -act sigmoid -d 0.2

to test a model using DeepPhylo in IBD microbiome-based diagnosis run sh:

python deepphylo_ibd_diagnosis_inference.py  -hs 16 -kec 7 -l 1e-4 -bs 64 -kep 2 -d 0.2 -p 0.0 -act 'sigmoid'

to train and evaluate a model using DeepPhylo in multi_label data run sh:

python deepphylo_classification_multi_label.py -hs 32 -kec 5 -l 5e-2 -bs 8 -kep 1 -act 'relu'

to test a model a model using DeepPhylo in multi_label data run sh:

python deepphylo_classification_multi_label_inference.py -hs 32 -kec 5 -l 5e-2 -bs 8 -kep 1 -act 'relu'

mystem_rpca.ipynb - Jupyter notebook to run unsupervised method on skin microbiome samples.

Example

In this example, we demonstrate the entire process of model training and testing for regression tasks. Here, our input is a normalized npy file that has already been processed, and the labels are also processed values

Train the model

You can perform this model training via deepphylo_*.py and set the parameters as below:

python deepphylo_regression.py  \
            --epochs 500 \
            -hs 40 \
            -kec 7 \
            -l 0.0002 \
            -bs 8 \
            -kep 7 \
            -act tanh

During the model training process, we train on the data from both the training and validation sets and then search for the best parameters. If the validation loss no longer decreases within 5 epochs, we end the loop. This is also known as the early stopping strategy to avoid overfitting of the model, and then save the best model parameters

Test the model

You can perform this model inference via deepphylo_*_inference.py and set the parameters as below:

python deepphylo_regression_inference.py 
            -test_X 'data/age_regression/X_test.npy'
            -test_Y 'data/age_regression/Y_test.npy'
            -hs 40 
            -kec 7 
            -l 0.0002 
            -bs 8 
            -kep 7 
            -act tanh

During the model testing process, we only need to input the input and label of the test set. It will load the best model parameters, output the predicted values, calculate R2_store, and display them on the command line

Citation

If you use DeepPhylo for your research, or incorporate our learning algorithms in your work, please cite:

@article{wang2024deepphylo,
  title={DeepPhylo: Phylogeny-Aware Microbial Embeddings Enhanced Predictive Accuracy in Human Microbiome Data Analysis},
  author={Wang, Bin and Shen, Yulong and Fang, Jingyan and Su, Xiaoquan and Xu, Zhenjiang Zech},
  journal={Advanced Science},
  volume={11},
  number={45},
  pages={2404277},
  year={2024},
  publisher={Wiley Online Library}
}

New version specifications

Current dependencies can be found in the requirements.txt file. The used Python version is 3.9.12.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepPhylo: Phylogeny‐Aware Microbial Embeddings Enhanced Predictive Accuracy in Human Microbiome Data Analysis

Dependencies

Installation

Data file descriptions

Data convert

Model Parameter

Running

Scripts

Example

Train the model

Test the model

Citation

New version specifications

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
data		data
deepphylo		deepphylo
readme_figure		readme_figure
.gitignore		.gitignore
README.md		README.md
deepphylo_classification.py		deepphylo_classification.py
deepphylo_classification_inference.py		deepphylo_classification_inference.py
deepphylo_classification_multi_label.py		deepphylo_classification_multi_label.py
deepphylo_classification_multi_label_inference.py		deepphylo_classification_multi_label_inference.py
deepphylo_ibd_diagnosis.py		deepphylo_ibd_diagnosis.py
deepphylo_ibd_diagnosis_inference.py		deepphylo_ibd_diagnosis_inference.py
deepphylo_regression.py		deepphylo_regression.py
deepphylo_regression_inference.py		deepphylo_regression_inference.py
mysystem_rpca.ipynb		mysystem_rpca.ipynb
requirements.txt		requirements.txt
setup.py		setup.py

CNwangbin/DeepPhylo

Folders and files

Latest commit

History

Repository files navigation

DeepPhylo: Phylogeny‐Aware Microbial Embeddings Enhanced Predictive Accuracy in Human Microbiome Data Analysis

Dependencies

Installation

Data file descriptions

Data convert

Model Parameter

Running

Scripts

Example

Train the model

Test the model

Citation

New version specifications

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages