This is a repository for the analysis presented in paper Aligned-UMAP for longitudinal datasets analysis in biomedical research [link to preprint]. It also contains the code for streamlit dashboard developed for better visualization of longitudinal trajectories.
- Used python implementation of Aligned-UMAP from umap-learn package
- Explored the utility of Aligned-UMAP in multiple longitudinal biomedical datasets
- Offer insights on optimal uses for the technique such as effect of hyperparameters
- An interactive 3D visualization of trajectory plots via streamlit dashboard to discover hidden patterns
- Install Anaconda Distribution
- Create a virtual environment to set up python interpreter
To create and activate a virtual environment:
conda create -n AlignedUMAP python=3.8
conda install -n AlignedUMAP -c conda-forge tslearn omegaconf umap-learn # for AlignedUMAP execution
conda install -n AlignedUMAP -c congda-forge wand python-kaleido # for visualization using jupyter notebook
conda activate AlignedUMAP && pip install -r requirements.txt # for streamlit dashboard
git clone https://github.com/NIH-CARD/AlignedUMAP-BiomedicalData.git
- subject_id is the sample index and time_id is the time point at which feature values are observed
- Prepare a csv file with the following format (see input_data/example_PPMI_clinical_assessment_data.csv):
subject_id | time_id | feature 1 | feature 2 | ..... | feature N |
---|---|---|---|---|---|
S0 | T0 | 10 | 1 | 1 | 1 |
S1 | T0 | 15 | 2 | 2 | 1 |
S0 | T1 | 20 | 1 | 1 | 1 |
S1 | T1 | 40 | 2 | 2 | 1 |
- (Optional) Prepare a metadata file to color the trajectories with following format (see input_data/example_metadata_PPMI_clinical_assessment_data.csv):
subject_id | color_column 1 | color_column 2 |
---|---|---|
S0 | Class0 | 10 |
S1 | Class1 | 20 |
S2 | Class0 | 30 |
S3 | Class1 | 40 |
Update configs/alignedUMAP_configuration.yaml file with data paths and required arguments
variable name | default | description |
---|---|---|
data_dir | "input_data" | Path to directory where input csv files are stored |
result_dir | "results_data" | Path to directory where aligned umap output will be stored |
cache_dir | "cache_data" | Path to directory where embeddings will be stored (useful for rerun) |
dataset_name | "example_PPMI_clinical_assessment_data.csv" | Input csv file name located in data_dir |
metadata_name | "example_metadata_PPMI_clinical_assessment_data.csv" | Input metadata file name located in data_dir (leave "" if no metadata available) |
perform_interpolation | 1 | Perform interpolation longitudinally (in case missing values) |
num_cores | -1 | Number of cores to use (-1 corresponds to all cores in machine) |
sample_fraction | 1 | Fraction of samples to use for alignedUMAP (in case very large sample count) |
metric | ["euclidean", "cosine"] | AlignedUMAP hyperparmeter1 |
alignment_regularisation | ["0.003", "0.030"] | AlignedUMAP hyperparmeter2 |
alignment_window_size | ["2", "3"] | AlignedUMAP hyperparmeter3 |
n_neighbors | ["03", "05", "10"] | AlignedUMAP hyperparmeter4 |
min_dist | ["0.01", "0.10"] | AlignedUMAP hyperparmeter5 |
Use jupyter notebook apply_alignedUMAP.ipynb to run AlignedUMAP with all combination of hyperparameters listed in configuration file.
conda activate AlignedUMAP
jupyter lab
# then run apply_alignedUMAP.ipynb
It will save output files in result_dir directory according to paths mentioned configuration file.
To visualize longitudinal trajectory plots, users can use streamlit dashboard or jupyter notebook.
Way1: Using streamlit dashboard
To view an interactive version of 3D plot, run the following commands in bash.
conda activate AlignedUMAP
streamlit run streamlit_app_local.py -- configs/alignedUMAP_configuration.yaml
The dashboard should appear like the following image in browser (on visiting localhost:8501).
Way2: Using jupyter notebook
Use jupyter notebook visualize_trajectories.ipynb to view non-interactive trajectory plots with different hyperparameters. It will also save plots as pdf in parameter_views_pdf directory.
conda activate AlignedUMAP
jupyter lab
# then run visualize_trajectories.ipynb
The following table lists the execution time of AlignedUMAP on different datasets. For more details, refer to Execution Time in Points to remember section and Fig. 4 of the paper.
Dataset | # samples | # features | # time sequence | time (one hyperparameter) |
---|---|---|---|---|
PPMI clinical | 476 | 122 | 6 | ~10 seconds |
lung scRNA | 10111 | 21767 | 7 | ~1000 seconds |
MIMIC-III | 36675 | 64 | 6 | ~750 seconds |