SparseDOSSA 2 is an upcoming software for the simulation of null and metadata-associated microbial abundance profiles. It has functionalities for a) fitting to user-provided microbial template datasets, b) synthesizing microbial abundances similar to either user-provided or pre-trained templates, and c) additionally simulating associations with sample metadata or among microbial features. We implemented SparseDOSSA 2 as an R package, currently available at its GitHub repo. We plan to submit the package for Bioconductor 3.12, to be released this fall.
Will add once the paper is out
SparseDOSSA2 is available through GitHub as of now and can be installed via the following command.
The most important functionality of
SparseDOSSA2 is the simulation of realistic synthetic microbial observations. To this end,
SparseDOSSA2 provides three pre-trained templates,
"IBD", targeting continuous, discrete, and diseased population structures.
Stool_simulation <- SparseDOSSA2(template = "Stool", n_sample = 100, n_feature = 100, verbose = TRUE) Vaginal_simulation <- SparseDOSSA2(template = "Vaginal", n_sample = 100, n_feature = 100, verbose = TRUE)
SparseDOSSA2 provide two functions,
fitCV_SparseDOSSA2, to fit the SparseDOSSA2 model to microbial count or relative abundance observations. For these functions, as input,
SparseDOSSA2 requires a feature-by-sample table of microbial abundance observations. We provide with SparseDOSSA2 a minimal example of such a dataset: a five-by-five of the HMP1-II stool study.
data("Stool_subset", package = "SparseDOSSA2") # columns are samples. Stool_subset[1:2, 1, drop = FALSE]
fit_SparseDOSSA2fits the SparseDOSSA2 model to estimate the model parameters: per-feature prevalence, mean and standard deviation of non-zero abundances, and feature-feature correlations. It also estimates joint distribution of these parameters and (if input is count) a read count distribution.
fitted <- fit_SparseDOSSA2(data = Stool_subset, control = list(verbose = TRUE)) # fitted mean log non-zero abundance values of the first two features fitted$EM_fit$fit$mu[1:2]
fitCV_SparseDOSSA2. They can either provide a vector of tuning parameter values (
lambdas) to control sparsity in the estimation of the correlation matrix parameter, or a grid will be selected automatically.
fitCV_SparseDOSSA2uses cross validation to select an "optimal" model fit across these tuning parameters via average testing log-likelihood. This is a computationally intensive procedure, and best-suited for users that would like accurate fitting to the input dataset, for best simulated new microbial observations on the same features as the input (i.e. not new features).
set.seed(1) fitted_CV <- fitCV_SparseDOSSA2(data = Stool_subset, lambdas = c(0.1, 1), K = 2, control = list(verbose = TRUE)) # the average log likelihood of different tuning parameters apply(fitted_CV$EM_fit$logLik_CV, 2, mean) # The second lambda (1) had better performance in terms of log likelihood, # and will be selected as the default fit.
sessionInfo() R version 3.6.2 (2019-12-12) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Mojave 10.14.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib locale:  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages:  stats graphics grDevices utils datasets methods base other attached packages:  SparseDOSSA2_0.99.0 Rmpfr_0.8-2 gmp_0.6-1 igraph_1.2.6  truncnorm_1.0-8 magrittr_2.0.1 future.apply_1.7.0 future_1.21.0  huge_220.127.116.11 mvtnorm_1.1-1 ks_1.11.7 BiocCheck_1.22.0 loaded via a namespace (and not attached):  Rcpp_1.0.5 compiler_3.6.2 BiocManager_1.30.10 bitops_1.0-6  tools_3.6.2 digest_0.6.27 mclust_5.4.7 jsonlite_1.7.2  lattice_0.20-41 pkgconfig_2.0.3 Matrix_1.2-18 graph_1.64.0  curl_4.3 parallel_3.6.2 xfun_0.20 stringr_1.4.0  httr_1.4.2 knitr_1.30 globals_0.14.0 stats4_3.6.2  grid_3.6.2 getopt_1.20.3 optparse_1.6.6 Biobase_2.46.0  listenv_0.8.0 R6_2.5.0 parallelly_1.23.0 XML_3.99-0.3  RBGL_1.62.1 codetools_0.2-18 biocViews_1.54.0 BiocGenerics_0.32.0  MASS_7.3-53 stringdist_0.9.6.3 RUnit_0.4.32 KernSmooth_2.23-18  stringi_1.5.3 RCurl_1.98-1.2