SCAKE: Consensus-based Association of K-mer Entropy

User Manual, Version 0.0.1

Authors: Yo Sup Moon, Curtis Huttenhower

Notes

IDEA: Try out some principled mutual information, etc, -based clusterings, see performance.
Increase flexibility of the model by performing ENSEMBLE CLUSTERING. Try out all-against-all testing for proof of concept.
Obviously, this is a subcase of HALLA without the hierarchical clustering. Build up, not down.
WANT: (1) Consant baseline property, (2) metric property, (3) normalization
You can characterize the "discretization process" as follows: effectively, what you are doing when you are discretizing is quantiling the support of the probability distribution and saying that close continguous regions cold the most amount of "information" when trying to do a density estimate (for whatever metric you are trying to estimate afterward). So, in that sense, you can think of the discretization as a very specific version of k-means clustering with euclidean distance based on the support supp(X). Parametric clustering can be done with K specified; non-parametric can be done by using e.g. a Dirichlet Process (PY, Two-parameter beta,etc) to estimate the densities.
Important to note: Measures of divergence defined on the atoms of the distribution obviously is conserved under bijective name-swappings, but not actual permutations
Let the number of clusters be K, the number of data examples be N. If N/K is small, then a correction for chance is needed to curb the effects of cluster size biasing the mutual information calculation.
Notion of consensus index:
- Consensus clustering is not just another clustering algorith,: it rather provides a framework for unifying the knowledge obtained from other algorithms.
- Given a data set, consensus clustering employs one or several clustering algorithms to generate a set of clustering solutions on either the original data set or its perturbed versions.
- Objective: discover high quality cluster structure
- Alternate objective: discover appropriate number of clusters present
- Vinh et al: empirically, in regard to the set of clusterings obtained, when specified number of clusters coincides with the true number of clusters, set has tendency to be less diverse
- To quantify this, define CI(U_K) = ∑_i < jAM(U_i, U_j)B(B − 1) ⁄ 2
  where the set of B clustering solutions U_K = {U₁, …, U_B}
  each have K clusters, AM is a suitable similarity measure
- The optimal number of clusters is K^⋆ = argmax_{K = 2, …, Kmax}CI(U_K)
Conclusion from Vinh et al paper: use NID or NVI (variation of information) for general purpose, but be cautious about what property you want your metric to have.
We can proceed this consensus index way, and see some results, and also go for more complicated weighted voting (averaging, fuzzy assignment) procedure for a 1-tier solution. HALLA can be seen as a generalization.
- Cluster bags using your favorite similarity measure, then perform all-against-all.
- Idea: for extremely, extremely large data sets, better to have hierarchical scheme for testing, since combinatorial divvy is expensive.

cake

User Manual, Version 0.0.1

Notes