Omics data is in the “p >> n” regime where there are fewer samples
than measurements per sample. This creates dual challenges in generating
realistic simulated data for the purposes of benchmarking. First, there
isn’t enough data to be able to compute a dependence structure (e.g., a
full-rank correlation matrix). Second, generating omics-scale data with
a specified correlation matrix is slow due to the typical
Here, we give a simple solution to both of these problems by using a low-rank correlation matrix to both approximate realistic dependencies in a real dataset and generate simulated data mimicking the real data. Using a NORTA (Normal to Anything) approach, the marginal (univariate) distributions can have realistic forms like the negative binomial appropriate for omics datasets like RNA-seq read counts. Our implementation supports normal, Poisson, DESeq2-based (negative binomial with sample-specific size factors), and empirical (for ordinal data) marginal distributions. This makes it particularly suited to RNA-seq data but also widely applicable.
Using this, simulating data that matches a given data
(with samples in
columns and measurements/features in rows) with each feature having a
negative binomial distribution is fast and simple:
library(dependentsimr)
head(read_counts) # An RNA-seq dataset
rs <- get_random_structure(list(counts=as.matrix(read_counts[,-1])), method="pca", rank=2, type="DESeq2")
simulated_data <- draw_from_multivariate_corr(rs, n_samples=5)$counts
head(simulated_data)
Finally, this also supports simultaneously generating multiple ‘modes’ of data, such as happens in multi-omics, where each node can have a distinct marginal distribution type. For example, proteomics might have normal margins and RNA-seq the DESeq2 margins. This captures cross-mode dependencies observed in the data as well as intra-mode.
You can install the development version of dependentsimr from GitHub with:
# install.packages("remotes")
remotes::install_github("tgbrooks/dependent_sim")
For an extended example of using this package to generate RNA-seq data, please see this vignette.