The Huttenhower Lab > idability build and evaluate hitting-set-based codes



Description is a standalone python program which processes tabular data to build and evaluate hitting-set-based codes. Table rows represent features, and table columns represent "samples" or "subjects". The entry (FS) represents subject S's value for feature F. In their simplest form, values are 1s and 0s, representing the presence or absence (respectively) of a feature within a given subject. In this simple example:

   S1 S2 S3
F1  0  1  1
F2  1  0  1
F3  1  1  0
F4  1  1  1

Subject S1 has features F2F3, and F4 but is missing feature F1.

code is a set of features that uniquely identify a given subject, in the sense that the features collectively distinguish that subject from the background population. Put another way, the subject's code features are never observed together in another subject. Features can be prioritized for inclusion in a code in a number of ways. In the simplest method, subject S's code is built from the rarest features in the population (more specifically, the code is grown iteratively by adding the rarest feature among non-S subjects that are not already excluded by a feature in the code). Under this formulation, the code is of minimal size, and its construction is analogous to the classical Hitting-Set Problem. In the simple example above, {F2F3} is a minimal code for S1. has been additionally optimized to build personalized codes from microbiome data (as derived from metagenomic sequencing experiments). In this context, minimal codes are suboptimal as they tend not to be robust to temporal variation, and so an alternative feature prioritization scheme is employed. Please refer to the Advanced Configuration and Metagenomic Codes Demo sections below for further details.


If you use or the datasets provided here in a publication, please cite:

Franzosa EA, Katherine H, Meadow JF, Gevers D, Lemon KP, Bohannan BJM, Huttenhower C. Identifying personal microbiomes using metagenomic codes. Proceedings of the National Academy of Sciences (2015): 201423854.

Prerequisites requires python 2.7+.


Clone the repository via the following command:

$ git clone

or directly download the repository as an archive and expand its contents.

Minimal Codes Demo

The repository contains a simple demo to illustrate basic use of the software (similar in spirit, but larger, than the minimal example above). To run the demo, execute:

$ ./ demo1.pcl

This produces a codes file called The first few lines of this file look like this:

S01     F15     F02     F13
S02     F14     F10     F05
S03     F08     F13

This indicates, for example, that the set of features {F02F13F15} were a unique code for subject S01. You can verify this by inspecting the input file demo1.pcl. If is passed a set of codes in addition to a table, it will apply the codes to the table and report which codes matched which subjects:

$ ./ demo1.pcl --codes

This produces a file called demo1.demo1.hits.txt. (The general form of the "hits" file is INPUT_TABLE.CODES_FILE.hits.txt; this can be configured using the program's --output flag.) The first few lines of the hits file look like:

# 1|TP: 12
# 2|TP+FP: 0
# 3|FN+FP: 0
# 4|FN: 0
# 5|NA: 3
S01     matches S01
S02     matches S02
S03     matches S03

The first five lines (beginning with #) represent a confusion matrix: they summarize which codes matched the correct subjects (true positives, TPs), which codes failed to match the correct subject (false negatives, FNs), and which codes spuriously matched the wrong subjects (false positives, FPs). Combinations of these categories are allowed. The "NA" category indicates the presence of a "null code": three of the subjects from demo1.pcl did not have a unique code as their features were a subset of some other subject's features.

The demo contains a second file, demo2.pcl, which represents a perturbation of the original table. Running the following command:

$ ./ demo2.pcl --codes

Produces a file called demo2.demo1.hits.txt. The first few lines of this file look like:

# 1|TP: 7
# 2|TP+FP: 3
# 3|FN+FP: 1
# 4|FN: 1
# 5|NA: 3
S01     matches S01
S02     no_matches
S03     matches S03     S05     S06

Note that the results now contain instances of FNs and FPs in addition to TPs. For example, S02 has been perturbed and its original code no longer matches the features detected in S02 in the second dataset (a FN). S03's original code continues to match S03, but now two other subjects (S05 and S06) also match this code (FPs).

Metagenomic Codes Demo was originally developed to explore individual-specific adaptation of human microbiomes. Specifically, we were interested to know if body sites within an individual contain a collection of microbial taxa or genes that uniquely distinguish that individual from the population. This is equivalent to the hitting-set based code construction process described above. However, our investigation revealed that minimal metagenomic codes were unstable over time. Hence, we adapted the classical greedy approach to minimal hitting-set construction to instead prioritize identification of stable metagenomic codes. The repository contains a demo based on microbial marker genes sampled from individuals involved in the Human Microbiome Project as surveyed by the MetaPhlAn software package. (A marker gene is a gene that is consistently found in isolate genomes from a given clade [here, bacterial and archaeal species] and not found outside that clade.)

To begin, unzip the two data files, which contain marker measurements for a set of 50 individuals' gut microbiomes (as represented from stool samples) sampled ~6 months apart.

$ gunzip markers-stool-visit1.pcl.gz 
$ gunzip markers-stool-visit2.pcl.gz 

Try running the default code construction process used above on the visit1 file, and then applying the visit1 codes to the visit2 table:

$ ./ markers-stool-visit1.pcl
$ ./ markers-stool-visit2.pcl --codes

This yields:

# 1|TP: 4
# 2|TP+FP: 1
# 3|FN+FP: 22
# 4|FN: 23
# 5|NA: 0

The results are less than stellar due to the prioritization of minimal (unstable) codes. Repeat this process by running the program in "meta_mode", which fine-tunes the code construction process to identify sets of features that are more likely to be stable over time:

$ ./ markers-stool-visit1.pcl --meta_mode rpkm
$ ./ markers-stool-visit2.pcl --codes --meta_mode rpkm

This yields:

# 1|TP: 43
# 2|TP+FP: 0
# 3|FN+FP: 1
# 4|FN: 6
# 5|NA: 0

The results are much better: the majority of individuals' visit2 samples still match their visit1 codes, and spurious matches are rare.

Advanced Configuration

The --meta_mode parameter used in the Microbial Community Demo above encapsulates a variety of advanced parameter settings to the program. These serve to increase code robustness and specificity in metagenomics applications. Settings include:

  • A novel feature prioritization scheme (abundance gap sorting)
  • More flexible definitions of feature presence and absence
  • A minimum code size threshold
  • Methods to exclude redundant features

These settings can be individually fine-tuned for user-specific applications. Consult the program's help menu to learn more about them:

$ ./ -h

Preprocessed Metagenomics Datasets

The following datasets were used in the publication and are available for download here. There is one gzipped tarball for each type of metagenomic features (OTUs, species, marker genes [markers], and kilobase windows [kbwindows]). Each tarball contains pairs of files for each body site: one file is a table reflecting subjects' feature measurements at their first visit (visit1), and a second file contains values from the follow-up visit (visit2). For example, otus-tables.tar.gz contains:

  • otus-anterior_nares-visit1.pcl
  • otus-anterior_nares-visit2.pcl

(Along with similar pairings for 17 other body sites.) The table formats are as described above (tab-delimitted with feature rows and subject columns). All data are derived from raw sequencing reads and analysis conducted during the Human Microbiome Project (HMP). Additional details of the features and datasets are available in the publication. Subjects are identified by HMP-issued RANDSID identifiers (column headers). Tables have been pre-processed to ensure that each subject appears in both tables with the same identifier. The OTU and species tables contain values in relative abundance units, while the MetaPhlAn markers and kilobase windows tables contain values in reads mapped per kilobase of genomic feature per million total reads (RPKM).