CCrepe

Compositionality Corrected by REnormalization and PErmutation

CCREPE is an R package designed to detect significant correlations in compositional data. Compositional data often contains ‘spurious’ correlation which can lead naive analyses astray. Microbial data is compositional in nature and thus inferring potential ecological networks is often difficult.

For more information on the technical aspects:

User Manual || User Tutorial || Forum

Citation:
Emma Schwager et al Detecting statistically significant associtations between sparse and high dimensional compositional data. (In progress)

Features

The CCREPE package comes with two functions:

- ccrepe, which provides compositionality-corrected p-values, q-values, and Z-scores for all pairwise correlations within one dataset or between two datasets;
- nc.score, which is an extension of the checkerboard score to ordinal data and thus provides a similarity measure more appropriate for compositional data analysis.

Getting started

Download CCREPE (version 1.7.0)

You can obtain the package from Bioconductor

You can also obtain the built R package by downloading the tarball in tar.gz format.

Or you can obtain the complete source code using hg:

$ git clone https://github.com/biobakery/ccrepe.git

CCREPE is covered under the MIT copyright license and is free to use without restriction in use or liability to the authors.

Manual/Vignette

The CCREPE manual and vignette are publicly available

Installing the CCREPE package and examples

1. Setting up CCREPE

A. Installing CCREPE as an R package

 $ R CMD INSTALL ccrepe_1.7.0.tar.gz

B. Accessing CCREPE from R

 > library(ccrepe)

2. Examples of using the ccrepe function (run in R)

A. With one dataset

> data

B. With two datasets

> data1

C. With an example user-defined similarity measure

See the function parameters for the requirements of user-defined similarity measures.

ccrepeSampleTestFunction

3. Examples of using the nc.score function (run in R)

A. With one matrix

data

B. With two vectors

data

C. Specifying the number of bins

data

D. Specifying the bin cutoffs

 data

ccrepe function usage and parameters

ccrepe usage:
ccrepe(
x = NA,
y = NA,
sim.score = cor,
sim.score.args = list(),
min.subj = 20,
iterations = 1000,
subset.cols.x = NULL,
subset.cols.y = NULL,
errthresh1 = 1e-04,
verbose = FALSE,
iterations.gap = 100,
distributions = NA,
compare.within.x = TRUE,
concurrent.output=NA,
make.output.table=FALSE)

ccrepe arguments:

x First dataframe or matrix containing the relative abundances in cavity1 : columns are bugs, rows are samples.
(Rows should therefore sum to a constant.)
The subjectIDs, if present, are assumed to be the row names and NOT the first column of data.
y Second dataframe or matrix (optional) containing the relative abundances in cavity2: columns are bugs, rows are samples.
The subjectIDs, if present, are assumed to be the row names.
If both x and y are specified, they will be merged by row names. If no row names are specified for either or both datasets,
the default is to use the row numbers as subject IDs.
sim.score A function defining a similarity measure, such as cor or nc.score. This similarity measure can be a pre-defined R function or user-defined. If the latter,
certain properties should be satisfied as detailed below (also see examples). The default similarity measure is Spearman correlation.
A user-defined similarity measure should:
1.Be able to take either two inputs which are vectors or one input which is either a matrix or a dataframe
2.In the case of two inputs, return a single number
3.In the case of one input, return a matrix in which the (i,j)th entry is the similarity score for column i and column j in the original matrix
4.Resulting matrix (in the case of one input) must be symmetric
5.The inputs must be named x and y
sim.score.args A list of arguments for the measurement function.
For example: In the case of cor, the following would be acceptable:
sim.score.args = list(method=’spearman’,use=’complete.obs’ ).
Note that this is the default behavior.
min.subj Minimum number of subjects that must be measured in a bug/feature/column in order to apply the similarity measure
to that bug/feature/column. This is to ensure that there are sufficient subjects to perform a bootstrap (default: 20)
iterations The number of iterations of bootstrap and permutation (default: 1000).
subset.cols.x
Subset of columns from x to work on. (Default: NULL – meaning: Use all columns of x.
All the columns of x are used for normalization but calculations are performed only with the requested subset (by default, all columns).
subset.cols.y Subset of columns from y to work on. (Default: NULL – meaning: Use all columns of y.
If applicable (y present), all the columns of y are used for normalization, but calculations are performed only with the requested subset (by default, all columns).
errthresh1 A numeric value representing the probability of getting all 0’s in a given bootstrapped column for the first dataset
If a bug/feature/column has a number of zeros that makes the probability of obtaining all zeros when sampling
with replacement > errthresh, that bug/feature/column will be excluded from the subsequent analysis. This is
to ensure that the standard deviation of the bootstrap sample is non-zero. (default= 0.0001)
verbose Logical: an indicator whether the user requested verbose output, which prints periodic progress of the algorithm through the dataset(s), as well as including more detailed output. (default:FALSE)
iterations.gap If output is verbose – number of iterations after issue a status message (Default=100 – displayed only if verbose=TRUE).
distributions Output Distribution file (default:NA).
compare.within.x A boolean value indicating whether to do comparisons given by taking all subsets of size 2 from subset.cols.x or to do comparisons given by taking all possible combinations of subset.cols.x or subset.cols.y. If TRUE but subset.cols.y=NA, returns all comparisons involving any features in subset.cols.x (default: TRUE)
concurrent.output Optional output file to which each comparison will be written as it is calculated (default:NA).
make.output.table A boolean value indicating whether to include table-formatted output (default:FALSE).

Output: Returns a list containing the calculation results and the parameters used
Default parameters shown:

min.subj Description above
errThresh Description above
sim.score A matrix of the simliarity scores for all the requested comparisons. The (i,j)th
element of sim.score correponds to the similarity score of column i (or the ith column of subset.cols.1)
and column j (or the jth column of subset.cols.1) in one dataset, or
to the similarity score of column i (or the ith column of subset.cols.1) in dataset x
and column j (or the jth column of subset.cols.2)in dataset y in the case of two datasets.
p.values A matrix of the p-values for all the requested comparisons. The (i,j)th element of p.values
corresponds to the p-value of the (i,j)th element of sim.score.
q.values A matrix of the Benjamini-Hochberg-Yekutieli FDR corrected p-values. The (i,j)th element
of q.values corresponds to the q-value fo the (i,j)th element of sim.score.
z.stat A matrix of the z-statistics used in generating the p-values for all requested comparisons. The (i,j)th element corresponds to the z-statistic generating the (i,j)th element of p.values.

Additional parameters if verbose=TRUE:

iterations Description Above
subset.cols.x Description Above
subset.cols.y Description Above
iterations.gap Description Above
sim.score.parameters Description Above

nc.score function usage and parameters

nc.score usage:
nc.score(x = NA,
y = NA,
bins = NA,
verbose = FALSE,
min.abundance = 1e-04,
min.samples = 0.1)

nc.score arguments:

x A numeric vector, data frame, or matrix. The first entity to be processed. Columns are bugs, rows are samples.
The subjectIDs, if present, are assumed to be the row names and NOT the first column of data.
y Numeric vector. If selected, x must be a numeric vector as well.
bins Either a single integer specifying the number of bins to use or a numeric vector specifying the cutoffs.
If a single number is given, this is used as the number of bins in the discretize function of the package infotheo.
If a vector is specified, the function findInterval is used to discretize the data according to the cutoffs given.
The default behavior is to use the defaults for the discretize function.
verbose Logical flag to request verbose output
min.abundance A numeric value specifying the minimum abundance threshold: Assures selection of species with abundance >= min.abundance in more than min.samples percent of samples.
min.samples An integer specifying the minimum samples threshold: Assures selection of species with abundance >= min.abundance in more than min.samples percent of samples.