-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
490 lines (364 loc) · 22.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
---
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.width = 7, fig.height = 7, cache = FALSE)
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
```
```{r load_libraries, include = FALSE, message = FALSE, warning = FALSE}
library(Seurat)
library(MOFA2)
library(MOFAcellulaR)
library(qs)
library(dplyr)
library(tidyverse)
library(magrittr)
library(edgeR)
library(DESeq2)
library(nichenetr)
library(ggplot2)
library(ggprism)
library(ggpmisc)
library(knitr)
source("utils/mofa_analysis_functions.R")
source("utils/plotting_functions.R")
set.seed(43648)
```
```{r import_data, include = FALSE, message = FALSE, warning = FALSE}
seurat <- qread("data/processed/lai_preprocessed.qs")
mofa_model <- qread("models/mofa_model1.qs")
sample_meta <- read.csv("data/mofa_inputs/sample_metadata.csv")
factor_score_perez <- read.csv("data/processed/factor_score_perez.csv", row.names = 1)
perez_bulk_meta <- read.csv("data/processed/perez_bulk_metadata.csv", row.names = 1)
```
```{r optional_memory_reduction, include = FALSE, message = FALSE, warning = FALSE}
seurat@assays[["RNA"]]@layers[["counts"]] <- NULL
seurat@assays[["RNA"]]@layers[["scale.data"]] <- NULL
gc()
```
```{r parameters, include = FALSE, message = FALSE, warning = FALSE}
factor_cutoff_score <- 0.3
test_variable <- "age_bin"
categorical_test_var <- c("age_bin","tech","sex")
continous_test_var <- c("age_numeric")
sample_id_column <- "sample"
aging_factor <- select_important_factor(mofa_model, sample_meta, sample_id_column, test_variable)
```
```{r subset_seurat, include = FALSE, message = FALSE, warning = FALSE}
cell_types <- unique(mofa_model@features_metadata[["view"]])
seurat_subset <- seurat[, seurat$annotation_level1 %in% cell_types]
```
# Introduction
A common method to identify potential therapeutic targets in transcriptomic data is to use differential expression (DE) to identify highly differentially expressed genes between two conditions. When it comes to single-cell/single-nucleus datasets, DE is great at identifying transcriptomic differences between cell types (a method commonly used to identify marker genes). ScRNAseq is also very useful for identifying differences in cell type quantities between samples/conditions. However, DE is often not ideal for identifying within-cluster, between-sample/condition transcriptomic changes in scRNAseq datasets (e.g., what are the transcriptomic differences between muscle endothelial cells from young and old individuals). This is because batch effects, noise, variation in the number of cells per sample, and variation in the number of samples/donors per condition are often quite large in scRNAseq, leading to in inaccurate results from DE.
Cells can have both cell-type specific and shared responses to the same cue, and well as feedback loops that occur through cell-cell interactions and between-organ signals (e.g. hormones). These multicellular changes are sometimes referred to as "multicellular programs". Between-sample DE looks at one cell-type at a time and does not consider these multicellular programs, which can result in "missing the forest for the trees". This is especially true for aging, which clearly impacts every celltype in the body in numerous ways.
We need more sophisticated methods that are able to extract important signals from noise and identify both celltype-specific and multicellular program changes. One such approach is to use Blind Source Separation techniques, such as PCA, ICA, and NMF. However, when applied directly to scRNAseq data, these methods will mainly identify differences between cell types. This limitation can be overcome by taking our genes x cells matrix, and turn it into a pseudobulk celltype x genes x samples tensor. We can then decompose the tensor with blind source separation to pull out the important biological signals.
Here, we analyze single-nucleus RNAseq data from Lai et al. 2024 (https://doi.org/10.1038/s41586-024-07348-6) using MOFAcell (https://doi.org/10.7554/eLife.93161) -- a factor analysis approach to single cell data. Ideally, this method will identify a factor that is highly correlated with age. That "Aging Factor" should then contain a scored list of how much each gene contributes to aging in each cell type, independent of technical noise or other biological factors.
# Results
## Dataset overview
This dataset contains muscle biopsies from n = 23 donors. Those donors were divided into two age groups: a younger population (15-46 years old, n = 8), and an older population (74-99 years old, n = 15). Of these donors, 6 were sequenced on both scRNAseq & snRNAseq, 16 were sequenced on snRNAseq only, and one was sequenced on scRNAseq only.
After alignment, filtering, batch correction, and annotation, the final dataset consisted of 79649 cells and 212774 nuclei.
```{r umap_overview, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 12, fig.height = 7}
DimPlot(seurat, group.by = c("annotation_level1", "tech" ,"age_group", "sex", "age_bin"), label = FALSE)
```
```{r umap_qc, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 6, fig.height = 6}
FeaturePlot(seurat, min.cutoff = "q1", max.cutoff = "q99",
features = c("nCount_RNA","nFeature_RNA","mitoRatio","riboRatio"))
```
In order to run our approach effectively, we removed celltypes with a small number of cells, leaving only the 10 major cell types in the dataset.
```{r umap_celltypes_subset, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 5, fig.height = 5}
DimPlot(seurat_subset, group.by = c("annotation_level1"), label = TRUE)
```
## Identification of Aging Factor
```{r factor_prep1, echo = FALSE, message = FALSE, warning = FALSE}
factor_loadings <- get_geneweights(model = mofa_model, factor = aging_factor)
assoc_list <- get_associations_list(mofa_model,
sample_meta,
sample_id_column,
categorical_test_var = categorical_test_var,
continuous_test_var = continous_test_var,
categorical_type = "not_parametric")
all_factors <- MOFAcellulaR::get_tidy_factors(
model = mofa_model,
metadata = sample_meta,
factor = "all",
sample_id_column = sample_id_column
)
factor_score_mat <- do.call(rbind,get_factors(mofa_model, factors = "all", groups = "all"))
```
We construct the data tensor, then decompose it into 7 factors using MOFAcell. We see that Factor5 is strongly correlated with aging.
```{r plot_heatmap1, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 7}
plot_MOFA_hmap(
model = mofa_model,
group = FALSE,
metadata = sample_meta,
sample_id_column = sample_id_column,
sample_anns = c("age_bin", "age_numeric", "sex", "tech"),
assoc_list = assoc_list
)
```
```{r aging_pval_table, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 7}
MOFAcellulaR::get_associations(mofa_model, sample_meta, sample_id_column, "age_bin", categorical_type = "parametric") %>%
kable()
```
This factor is able to separate old from young samples even across different sample preparation methodologies.
```{r group_boxplot_Factor5, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 5, fig.height = 5}
group_boxplot <- create_factor_expr_group_boxplot(
all_factors,
aging_factor,
xvar = "tech",
yvar = "value",
fillvar = "age_bin"
)
plot(group_boxplot)
```
This is particularly impressive because there are large differences in celltypes between the two methodologies, so this approach is able to pull system-wide signals out of complex datasets. Additionally, this approach is completely unsupervised -- at no point was the algorithm informed of sample classifications.
```{r umap_celltype_split, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 5}
DimPlot(seurat_subset, group.by = c("annotation_level1"), split.by = "tech", label = TRUE)
```
Other Factors are strongly correlated with technical batch effects or artifacts. For example, Factor1 represents the batch effects between cells and nuceli.
```{r group_boxplot_Factor1, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 5, fig.height = 5}
group_boxplot <- create_factor_expr_group_boxplot(
all_factors,
"Factor1",
xvar = "tech",
yvar = "value",
fillvar = "age_bin"
)
plot(group_boxplot)
```
Factor3 is strongly correlated with the mean counts per cell of each sample.
```{r factor1_cor, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 5, fig.height = 5}
mean_sample_qc_scores <- [email protected][, c("donor_tech", "mitoRatio", "nCount_RNA", "nFeature_RNA", "riboRatio", "log10GenesPerUMI")] %>%
group_by(donor_tech) %>%
summarise_all("mean") %>%
left_join(
.,
rownames_to_column(data.frame(factor_score_mat), var = "donor_tech"),
by = "donor_tech"
) %>%
select(-donor_tech)
mean_sample_qc_scores %>%
ggplot(aes(x = Factor3, y = nCount_RNA)) +
geom_point() +
stat_poly_eq(use_label("R2","P"), size = 5) +
stat_poly_line(se = FALSE) +
theme_prism() +
theme(aspect.ratio = 1) +
ggtitle("Mean nCount_RNA vs Factor 3 Score")
```
## Analysis of the Aging Factor
Now that we have identified the Aging Factor and shown that it is independent of technical noise, we will analyze the factor to extract biological insights.
First, we can investigate how much each celltype contributes to the Aging Factor. We see that the skeletal muscle cells are the strongest contributors, followed by Fibro-adipogenic progenitors. This is not surprising, since the FAPs are known to be regulators of MuSC function and muscle regeneration, and are known to be perturbed by aging.
```{r celltype_contributions, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 5, fig.height = 5}
variance_explained_barplot <- create_variance_explained_barplot(mofa_model, aging_factor)
variance_explained_barplot
```
Next, we look at the the gene contributions to the Aging Factor in each celltype. We see that a large majority of genes only contribute to the Aging Factor in a few cell types. However, there about 14% of genes contribute to the Aging Factor in 3 or more celltypes. Further analysis of these genes could identify high-value targets to reverse age-related muscle wasting across multiple celltypes (but is not included in this analysis here).
```{r gene_contribution_plots, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 5, fig.height = 5}
gene_counts_in_factor <- get_gene_counts_in_factor(factor_loadings, factor_cutoff_score)
shared_genes_table <- get_shared_genes_table(factor_loadings, factor_cutoff_score)
shared_genes_plot <- create_shared_genes_plot(shared_genes_table)
plot(shared_genes_plot)
```
Unsurprisingly, we see a strong concordance in celltype contributions between closely related celltypes (e.g. FastSKM & SlowSKM, or FAPs and Tenocytes).
```{r factor_loading_heatmap, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 5, fig.height = 5}
factor_loadings_mat <- get_factor_loadings_matrix(mofa_model, aging_factor)
Heatmap(
factor_loadings_mat,
show_row_names = FALSE,
heatmap_legend_param = list(title = "Gene Contribution")
)
```
We can input these gene lists into downstream methods, either by taking the top genes or by using the factor score of each gene for each celltype similar to a names logfoldchange vector.
For example, we perform pathway enrichment analysis of the Aging Factor. We see that younger samples are highly enriched in pathways related to myogenesis and fatty acid metabolism. This is consistent with the literature, which shows that fatty acid metabolism is disregulated in muscle during aging; research in mice has demonstrated that restoring fatty acid metabolism to healthy levels in aged mice restores muscle mass and strength (https://doi.org/10.1126/science.abc8059).
We see that the older samples are enriched in pathways related to inflammation and DNA damage. This points to targeting inflammaging as a potential treatment for age-related muscle wasting. In my opinion, this is likely the most important mechanism to target. A [recent study](https://doi.org/10.1038/s41586-024-07701-9) found that blocking the inflammatory cytokine interleukin-11 increased lifespan in both male and female mice by over 20%, and helped them retain healthy body composition for longer.
```{r hallmark_pathway_enrichment, echo = FALSE, message = FALSE, warning = FALSE}
hallmark_pathways <- clusterProfiler::read.gmt("data/processed/h.all.v2023.1.Hs.symbols.gmt")
hallmark_pathways <- split(hallmark_pathways$gene, hallmark_pathways$term)
pval_cutoff <- 0.1
factor_loadings_filt <- factor_loadings %>% dplyr::filter(abs(value) >= factor_cutoff_score)
factor_loadings_split <- factor_loadings_filt %>%
dplyr::rename(
"celltype" = "ctype",
"gene" = "feature"
) %>%
dplyr::filter(value != 0) %>%
dplyr::mutate(dir = ifelse(value > 0, "young", "old")) %>%
dplyr::mutate(celltype = paste0(celltype, "_", dir)) %>%
dplyr::select(-dir) %>%
dplyr::mutate(value = abs(value))
factor_loadings_split_list <- factor_loadings_split %>%
dplyr::select(-value) %>%
dplyr::group_by(celltype) %>%
nest() %>%
dplyr::mutate(data = map(data, ~ .x[[1]])) %>%
deframe()
gseAnalysis <- function(geneList,Annotation_DB){
geneList = geneList[geneList %in% unique(unlist(Annotation_DB))]
ResultsDF = matrix(0,nrow = length(Annotation_DB),ncol = 5)
rownames(ResultsDF) = names(Annotation_DB)
colnames(ResultsDF) = c("GenesInPathway","GenesInList","GeneNames","p_value","corr_p_value")
DB_genecontent = length(unique(unlist(Annotation_DB)))
GenesDB = DB_genecontent
SelectedGenes = length(geneList)
for(gset in rownames(ResultsDF)){
GP = length(Annotation_DB[[gset]])
GL = length(intersect(Annotation_DB[[gset]],geneList))
ResultsDF[gset,"GenesInList"] = GL
ResultsDF[gset,"GenesInPathway"] = GP
ResultsDF[gset,"GeneNames"] = paste(intersect(Annotation_DB[[gset]],geneList),collapse = ",")
ResultsDF[gset,"p_value"] = phyper(q=GL - 1, m=GP, n=GenesDB-GP, k=SelectedGenes, lower.tail = FALSE, log.p = FALSE)
}
ResultsDF[,"corr_p_value"] = p.adjust(ResultsDF[,"p_value"],method = "BH")
ResultsDF = data.frame(ResultsDF,stringsAsFactors = F)
ResultsDF = ResultsDF[order(ResultsDF[,"p_value"]),]
ResultsDF = ResultsDF %>%
rownames_to_column("gset") %>%
mutate_at(c("GenesInPathway","GenesInList",
"p_value","corr_p_value"),
as.numeric) %>%
dplyr::arrange(corr_p_value,GenesInList)
return(ResultsDF)
}
split_path_enrich <- map(factor_loadings_split_list, gseAnalysis, Annotation_DB = hallmark_pathways)
```
```{r hallmark_pathway_plotting, echo = FALSE, message = FALSE, warning = FALSE}
## downregulated pathways (old)
col_fun_fact_neg <- circlize::colorRamp2(seq(0, 5, length = 50), hcl.colors(50, "Blues", rev = T))
neg_sets <- split_path_enrich %>%
enframe() %>%
unnest(cols = c(value)) %>%
dplyr::filter(corr_p_value < pval_cutoff,grepl("old", name)) %>%
dplyr::group_by(name) %>%
pull(gset)
neg_sets <- neg_sets[duplicated(neg_sets)] # only use pathwasy expressed in >1 cell type
neg_enrichment <- split_path_enrich %>%
enframe() %>%
unnest(cols = c(value)) %>%
dplyr::filter(gset %in% neg_sets,grepl("old", name)) %>%
dplyr::mutate(corr_p_value = -log10(corr_p_value)) %>%
dplyr::mutate(corr_p_value = ifelse(corr_p_value > 5, 5, corr_p_value)) %>%
dplyr::mutate(
gset = gsub("HALLMARK_", "", gset) %>% strtrim(width = 40),
name = gsub("_old", "", name)
)
neg_plot <- neg_enrichment %>%
dplyr::select(name, gset, corr_p_value) %>%
pivot_wider(names_from = name, values_from = corr_p_value) %>%
column_to_rownames("gset") %>%
as.matrix() %>%
ComplexHeatmap::Heatmap(.,
name = "old \n-log(pval)",
show_row_dend = F,
show_column_dend = F,
row_names_gp = grid::gpar(fontsize = 8),
column_names_gp = grid::gpar(fontsize = 10),
col = col_fun_fact_neg,
column_title = "Enriched Pathways in Old Muscle"
)
plot(neg_plot)
## upregulated pathways (young)
col_fun_fact_pos <- circlize::colorRamp2(seq(0, 5, length = 50), hcl.colors(50, "Reds", rev = T))
pos_sets <- split_path_enrich %>%
enframe() %>%
unnest(cols = c(value)) %>%
dplyr::filter(
corr_p_value <= pval_cutoff,
grepl("young", name)
) %>%
dplyr::group_by(name) %>%
dplyr::slice(1:6) %>%
pull(gset)
pos_sets <- pos_sets[duplicated(pos_sets)] # only use pathwasy expressed in >1 cell type
pos_enrichment <- split_path_enrich %>%
enframe() %>%
unnest(cols = c(value)) %>%
dplyr::filter(
gset %in% pos_sets,
grepl("young", name)
) %>%
dplyr::mutate(corr_p_value = -log10(corr_p_value)) %>%
dplyr::mutate(corr_p_value = ifelse(corr_p_value > 5, 5, corr_p_value)) %>%
dplyr::mutate(
gset = gsub("HALLMARK_", "", gset) %>% strtrim(width = 35),
name = gsub("_young", "", name)
)
pos_plot <- pos_enrichment %>%
dplyr::select(name, gset, corr_p_value) %>%
pivot_wider(names_from = name, values_from = corr_p_value) %>%
column_to_rownames("gset") %>%
ComplexHeatmap::Heatmap(.,
name = "young \n-log(pval)",
show_row_dend = F,
show_column_dend = F,
row_names_gp = grid::gpar(fontsize = 8),
column_names_gp = grid::gpar(fontsize = 10),
col = col_fun_fact_pos,
column_title = "Enriched Pathways in Young Muscle"
)
plot(pos_plot)
```
## Projection onto bulk datasets
We can extend that analysis by projecting the Aging Factor onto bulk RNAseq datasets. Although scRNAseq is great for its ability to investigate different cell types, its cost often results in a small number of donors per dataset. This is true for the snRNAseq dataset of interest, which had N=17 donors. Although this is on the larger end for a single-cell dataset, N=17 is not enough donors to get strong correlations between transcriptomic signatures and clinical information. Bulk RNAseq is much cheaper in comparison, so bulk RNAseq datasets tend to contain more donors, which enables better correlations with clinical measurements.
Here, we project the Aging Factor onto a large bulkRNAseq dataset (also from the Perez et al. manuscript) of muscle aging and [sarcopenia](https://en.wikipedia.org/wiki/Sarcopenia) (age-related muscle wasting). We show that the Aging Factor projection is able to separate donors by condition, and is a better predictor of clinical measurements of muscle strength than medical diagnosis.
This bulk RNAseq dataset consists of 19 Young Health donors, 33 Old Healthy donors, and 24 old donors diagnosed with sarcopenia. We then project the aging factor from the single-cell dataset onto the bulk RNAseq dataset, and can see that it generally separates the three clinical groups.
```{r project_onto_human, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 7, fig.height = 5}
## organize for plotting
factor_score_perez <- factor_score_perez[order(rownames(factor_score_perez)), sort(colnames(factor_score_perez))]
perez_bulk_meta <- perez_bulk_meta[order(rownames(perez_bulk_meta)), ]
perez_bulk_meta$group %<>% as.character()
perez_bulk_meta$group %<>% factor(levels = c("Young Healthy", "Old Healthy", "Sarcopenia"))
row_ha <- rowAnnotation(
condition = perez_bulk_meta$group,
col = list(condition = c(
"Young Healthy" = "green",
"Old Healthy" = "yellow",
"Sarcopenia" = "darkred"
))
)
Heatmap(
factor_score_perez,
right_annotation = row_ha,
show_row_names = FALSE,
clustering_method_rows = "ward.D",
column_title = "Projection of Aging Factor onto Human Bulk RNAseq",
heatmap_legend_param = list(title = "Factor Projection Score")
)
```
```{r human_proj_celltype_diff, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 7, fig.height = 5}
plot_fact_cell_proj <- data.frame(factor_score_perez, group = perez_bulk_meta$group) %>% reshape2::melt()
plot_fact_cell_proj$group %<>% factor(levels = c("Young Healthy", "Old Healthy", "Sarcopenia"))
ggplot(plot_fact_cell_proj, aes(x = variable, y = value, fill = group)) +
geom_boxplot() +
theme_prism() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
xlab(label = "") +
ylab(label = "Aging Factor Score")
```
<!-- ```{r project_onto_human_oldonly, include = FALSE, message = FALSE, warning = FALSE} -->
<!-- ## redo with only aged for clinical correlations -->
<!-- perez_bulk_meta_old <- perez_bulk_meta[perez_bulk_meta$group != "Young Healthy",] -->
<!-- factor_score_perez_old <- factor_score_perez[rownames(factor_score_perez) %in% perez_bulk_meta_old$sample,] -->
<!-- ## organize for plotting -->
<!-- factor_score_perez_old <- factor_score_perez_old[order(rownames(factor_score_perez_old)), sort(colnames(factor_score_perez_old))] -->
<!-- perez_bulk_meta_old <- perez_bulk_meta_old[order(rownames(perez_bulk_meta_old)), ] -->
<!-- perez_bulk_meta_old$group %<>% as.character() -->
<!-- perez_bulk_meta_old$group %<>% factor(levels = c("Old Healthy", "Sarcopenia")) -->
<!-- ``` -->
<!-- ```{r human_clinical_correlations, fig.width = 12, echo = FALSE, message = FALSE, warning = FALSE} -->
<!-- clin_cor <- data.frame(factor_score = factor_score_perez_old[,5], perez_bulk_meta_old[, c(6, 2:5, 7:9)]) %>% -->
<!-- rownames_to_column("sample") %>% -->
<!-- reshape2::melt(id.vars = c("factor_score", "sample", "group")) %>% -->
<!-- filter(variable != "walkTest") -->
<!-- ggplot(clin_cor, aes(x = factor_score, y = value)) + -->
<!-- geom_point(aes(color = group)) + -->
<!-- theme_classic() + -->
<!-- facet_wrap(~variable, scale = "free_y") + -->
<!-- stat_poly_line(se = FALSE) + -->
<!-- ylab("Clinical Measurement") + -->
<!-- xlab("Aging Factor Expression") -->
<!-- ``` -->
# Conclusion
Here, we have shown that we can extend factor analysis to scRNAseq data, and that this technique is a powerful method to extract important biological signal from technical noise. We additionally show that utilizing this method on a snRNAseq muscle aging dataset can identify the biological signature of aging, and that this signature is aligned with experimental results. Further analysis into the genes of this Aging Factor can reveal potential therapuetic targets for muscle wasting diseases like sarcopenia.
The factor extracted from this approach can be used similar to logFC values and passed to downstream methods, such as drug repuroposing (https://doi.org/10.1038/s41467-023-36637-3) or cell-cell communication analysis (https://doi.org/10.1038/s41592-019-0667-5, https://doi.org/10.1101/2023.08.19.553863 ).
Additionally, we can project these factors onto bulk RNAseq data for a powerful analysis of disease states, and clinical data.