Skip to content

Latest commit

 

History

History
61 lines (48 loc) · 2.59 KB

README.md

File metadata and controls

61 lines (48 loc) · 2.59 KB

dependentsimr

Generating Correlated Data for Omics Simulation

Omics data is in the “p >> n” regime where there are fewer samples than measurements per sample. This creates dual challenges in generating realistic simulated data for the purposes of benchmarking. First, there isn’t enough data to be able to compute a dependence structure (e.g., a full-rank correlation matrix). Second, generating omics-scale data with a specified correlation matrix is slow due to the typical $O(p^3)$ nature of these algorithms. These often mean that simulators assume independence of the measurements, which does not reflect reality.

Here, we give a simple solution to both of these problems by using a low-rank correlation matrix to both approximate realistic dependencies in a real dataset and generate simulated data mimicking the real data. Using a NORTA (Normal to Anything) approach, the marginal (univariate) distributions can have realistic forms like the negative binomial appropriate for omics datasets like RNA-seq read counts. Our implementation supports normal, Poisson, DESeq2-based (negative binomial with sample-specific size factors), and empirical (for ordinal data) marginal distributions. This makes it particularly suited to RNA-seq data but also widely applicable.

Using this, simulating data that matches a given data (with samples in columns and measurements/features in rows) with each feature having a negative binomial distribution is fast and simple:

library(dependentsimr)
head(read_counts) # An RNA-seq dataset
rs <- get_random_structure(list(counts=as.matrix(read_counts[,-1])), method="pca", rank=2, type="DESeq2")
simulated_data <- draw_from_multivariate_corr(rs, n_samples=5)$counts
head(simulated_data)

Finally, this also supports simultaneously generating multiple ‘modes’ of data, such as happens in multi-omics, where each node can have a distinct marginal distribution type. For example, proteomics might have normal margins and RNA-seq the DESeq2 margins. This captures cross-mode dependencies observed in the data as well as intra-mode.

Installation

You can install the development version of dependentsimr from GitHub with:

# install.packages("remotes")
remotes::install_github("tgbrooks/dependent_sim")

Vignette

For an extended example of using this package to generate RNA-seq data, please see this vignette.