microPITA

The Huttenhower Lab > microPITA

 

microPITA

microbiomes: Picking Interesting Taxonomic Abundance

microPITA is a computational tool enabling sample selection in two-stage (tiered) studies. Using two-stage designs can more efficiently allocate resources, reducing study costs, and maximizing the use of samples. From a survey study, selection of samples can be performed to target various microbial communities including:

  • Samples with the most diverse community (maximum diversity);
  • Samples dominated by specific microbes (targeted feature);
  • Samples with microbial communities representative of the survey (representative dissimilarity);
  • Samples with the most extreme microbial communities in the survey (most dissimilar);
  • Given a phenotype (like disease state), samples at the border of phenotypes (discriminant) or samples typical of each phenotype (distinct).

Additionally, methods can leverage clinical metadata by stratifying samples into groups. This enables the use of microPITA in cohort studies.

For more information on the technical aspects:

User Manual || User Tutorial || Forum

Citation:
Tickle T, Segata N, Waldron L, Weingart U, Huttenhower C. Two-stage microbial community experimental design. ISMEJ 2013

PCL files associated with this publication can be found at Click here for files

Use MicroPITA in Galaxy

To use microPITA as a galaxy module visit huttenhower.sph.harvard.edu/galaxy/

Download microPITA (version 1.1.0)

microPITA is covered under the MIT copyright license and is free to use without restriction in use or liability to the authors.

You can obtain the complete analysis package using hg:

$ git clone https://github.com/biobakery/micropita

 

Getting started

  • MicroPITA unsupervised method selection in the HMP 16S Gut Microbiome. Selection of 10 samples using targeted feature targeting Bacteroides (blue), maximum diversity (orange), representative dissimilarity (purple), and most dissimilar (pink) using Principle Covariance Analysis (PCoA) for ordination. Targeted feature selects samples dominated by Bacteroides (upper left) while maximum diversity select more diverse samples away from Bacteroides dominant samples. Representative selection selects samples covering the range of samples in the PCoA plot focusing on the higher density central region while most dissimilar selects samples at the periphery of the plot.

    Serial unsupervised method selection in the HMP 16S Gut Microbiome. Selection progresses in groups of 10 from 10 to 228 selected HMP Gut Microbiome samples using maximum diversity (orange), targeted feature targeting Bacteroides (blue), representative dissimilarity (purple), and most dissimilar (pink). Sample space is visualized using PCoA for ordination. All trends seen in selection 10 samples continue throughout selection (trends being diverse and Bacteroides dominant selecting communities occupying opposite ends of the sample space, representative selection selecting the full sample space, and most dissimilar selecting from the periphery to the core of the samples space).



    Common commands

    These common commands can be used on the default data set obtained when downloading microPITA, simply cut and paste them into a commandline in the downloaded microPITA directory.

    Expected input file.

    I. PCL file or BIOM file

    BIOM file definition:
    For BIOM file definition please see http://biom-format.org/

    PCL file definition:
    Although some defaults can be changed, microPITA expects a PCL file as an input file. Several PCL files are supplied by default in the input directory. A PCL file is a TEXT delimited file similar to an excel spread sheet with the following characteristics.

    1. Rows represent metadata and features (bugs), columns represent samples.
    2. The first row by default should be the sample ids.
    3. Metadata rows should be next.
    4. Lastly, rows containing features (bugs) measurements (like abundance) should be after metadata rows.
    5. The first column should contain the ID describing the column. For metadata this may be, for example, “Age” for a row containing the age of the patients donating the samples. For measurements, this should be the feature name (bug name).
    5. By default the file is expected to be TAB delimited.
    6. If a consensus lineage or hierarchy of taxonomy is contained in the feature name, the default delimiter between clades is the pipe (“|”).

    II. Targeted feature file
    If using the targeted feature methodology, you will need to provide a txt file listing the feature(s) of interest. Each feature should be on it’s own line and should be written as found in the input PCL file.

    Basic unsupervised methods

    Please note, all calls to microPITA should work interchangeably with PCL or BIOM files. BIOM files do not require the –lastmeta or –id arguments.

    There are four unsupervised methods which can be performed:
    diverse (maximum diversity), extreme (most dissimilar), representative (representative dissimilarity) and features (targeted feature).

    The first three methods are performed as follows (selecting a default 10 samples):

    > python MicroPITA.py –lastmeta Label -m representative input/Test.pcl output.txt
    > python MicroPITA.py -m representative input/Test.biom output.txt

    > python MicroPITA.py –lastmeta Label -m diverse input/Test.pcl output.txt
    > python MicroPITA.py -m diverse input/Test.biom output.txt

    > python MicroPITA.py –lastmeta Label -m extreme input/Test.pcl output.txt
    > python MicroPITA.py -m extreme input/Test.biom output.txt

    Each of the previous methods are made up of the following pieces:
    1. python MicroPITA.py to call the microPITA script.
    2. –lastmeta which indicates the keyword (first column value) of the last row that contains metadata (PCL input only).
    3. -m which indicates the method to use in selection.
    4. input/Test.pcl or input/Test.biom which is the first positional argument indicating an input file
    5. output.txt which is the second positional argument indicating the location to write to the output file.

    Selecting specific features has additional arguments to consider –targets (required) and –feature_method (optional).

    > python MicroPITA.py –lastmeta Label -m features –targets input/TestFeatures.taxa input/Test.pcl output.txt
    > python MicroPITA.py -m features –targets input/TestFeatures.taxa input/Test.biom output.txt

    > python MicroPITA.py –lastmeta Label -m features –feature_method abundance –targets input/TestFeatures.taxa input/Test.pcl output.txt
    > python MicroPITA.py -m features –feature_method abundance –targets input/TestFeatures.taxa input/Test.biom output.txt

    These additional arguments are described as:
    1. –targets The path to the file that has the features (bugs or clades) of interest. Make sure they are written as they appear in your input file!
    2. –feature_method is the method of selection used and can be based on ranked abundance (“rank”) or abundance (“abundance”). The default value is rank.
    To differentiate the methods, rank tends to select samples in which the feature dominates the samples regardless of it’s abundance.
    Abundance tends to select samples in which the feature is most abundant without a guarantee that the feature is the most abundant feature in the sample.

    Basic supervised methods

    Two supervised methods are also available:
    distinct and discriminant

    These methods require an additional argument –label which is the first column keyword of the row used to classify samples for the supervised methods.
    These methods can be performed as follows:

    > python MicroPITA.py –lastmeta Label –label Label -m distinct input/Test.pcl output.txt
    > python MicroPITA.py –label Label -m distinct input/Test.biom output.txt

    > python MicroPITA.py –lastmeta Label –label Label -m discriminant input/Test.pcl output.txt
    > python MicroPITA.py –label Label -m discriminant input/Test.biom output.txt

    Custom alpha- and beta-diversities

    The default alpha diversity for the maximum diversity sampling method is inverse simpson; the default beta-diversity for representative and most dissimilar
    selection is bray-curtis dissimilarity. There are several mechanisms that allow one to change this. You may:

    1. Choose from a selection of alpha-diveristy metrics.
    Note when supplying an alpha diversity. This will affect the maximum diveristy sampling method only. Please make sure to use a diversity metric where the larger number indicates a higher diversity. If this is not the case make sure to use the -f or –invertDiversity flag to invert the metric. The inversion is multiplicative (1/alpha-metric).

    > python MicroPITA.py –lastmeta Label -m diverse -a simpson input/Test.pcl output.txt
    > python MicroPITA.py -m diverse -a simpson input/Test.biom output.txt

    A case where inverting the metric is needed.

    > python MicroPITA.py –lastmeta Label -m diverse -a dominance -f input/Test.pcl output.txt
    > python MicroPITA.py -m diverse -a dominance -f input/Test.biom output.txt

    2. Choose from a selection of beta-diversity metrics.
    Note when supplying a beta-diversity. This will effect both the representative and most dissimilar sampling methods. The metric as given will be used for the representative method while 1-beta-metric is used for the most dissimilar.

    > python MicroPITA.py –lastmeta Label -m representative -b euclidean input/Test.pcl output.txt
    > python MicroPITA.py -m representative -b euclidean input/Test.biom output.txt

    > python MicroPITA.py –lastmeta Label -m extreme -b euclidean input/Test.pcl output.txt
    > python MicroPITA.py -m extreme -b euclidean input/Test.biom output.txt

    Note for using Unifrac. Both Weighted and Unweighted unifrac are available for use. Make sure to supply the associated tree (-o, –tree) and environment files
    (-i,–envr) as well as indicate using Unifrac with (-b,–beta)

    > python MicroPITA.py –lastmeta Label -m extreme -b unifrac_weighted -o input/Test.tree -i input/Test-env.txt input/Test.pcl output.txt
    > python MicroPITA.py -m extreme -b unifrac_weighted -o input/Test.tree -i input/Test-env.txt input/Test.biom output.txt
    > python MicroPITA.py –lastmeta Label -m extreme -b unifrac_unweighted -o input/Test.tree -i input/Test-env.txt input/Test.pcl output.txt
    > python MicroPITA.py -m extreme -b unifrac_unweighted -o input/Test.tree -i input/Test-env.txt input/Test.biom output.txt
    > python MicroPITA.py –lastmeta Label -m representative -b unifrac_weighted -o input/Test.tree -i input/Test-env.txt input/Test.pcl output.txt
    > python MicroPITA.py -m representative -b unifrac_weighted -o input/Test.tree -i input/Test-env.txt input/Test.biom output.txt
    > python MicroPITA.py –lastmeta Label -m representative -b unifrac_unweighted -o input/Test.tree -i input/Test-env.txt input/Test.pcl output.txt
    > python MicroPITA.py -m representative -b unifrac_unweighted -o input/Test.tree -i input/Test-env.txt input/Test.biom output.txt

    3. Supply your own custom alpha-diversity per sample as a metadata (row) in your pcl file.

    > python MicroPITA.py –lastmeta Label -m diverse -q alpha_custom input/Test.pcl output.txt
    > python MicroPITA.py -m diverse -q alpha_custom input/Test2.biom output.txt

    4. Supply your own custom beta diversity as a matrix (provided in a seperate file).

    > python MicroPITA.py –lastmeta Label -m representative -x input/Test_Matrix.txt input/Test.pcl output.txt
    > python MicroPITA.py -m representative -x input/Test_Matrix.txt input/Test.biom output.txt
    > python MicroPITA.py –lastmeta Label -m extreme -x input/Test_Matrix.txt input/Test.pcl output.txt
    > python MicroPITA.py -m extreme -x input/Test_Matrix.txt input/Test.biom output.txt

    Changing defaults

    Sample Selection:
    To change the number of selected samples for any method use the -n argument. This example selects 6 representative samples instead of the default 10.

    > python MicroPITA.py –lastmeta Label -m representative -n 6 input/Test.pcl output.txt
    > python MicroPITA.py -m representative -n 6 input/Test.biom output.txt

    When using a supervised method this indicates how many samples will be selected per class of sample. For example if you are performing supervised selection of 6 samples (-n 6) on a dataset with 2 classes (values) in it’s label row, you will get 6 x 2 = 12 samples. If a class does not have 6 samples in it, you will get the max possible for that class. In a scenario where you are selecting 6 samples (-n 6) and have two classes but one class has only 3 samples then you will get 6 + 3 = 9 selected samples.

    Stratification:
    To stratify any method use the –stratify argument which is the first column keyword of the metadata row used to stratify samples before selection occurs. (Selection will occur independently within each strata). This example stratifies diverse selection by the “Label”.

    > python MicroPITA.py –lastmeta Label –stratify Label -m representative input/Test.pcl output.txt
    > python MicroPITA.py –stratify Label -m representative input/Test.biom output.txt

    > python MicroPITA.py –lastmeta Label –label Label –stratify StratifyLabel -m distinct input/Test.pcl output.txt
    > python MicroPITA.py –label Label –stratify StratifyLabel -m distinct input/Test2.biom output.txt

    Changing PCL file defaults:
    Some PCL files have feature metadata. These are columns of data that comment on bug features (rows) in the file. An example of this could be a certain taxonomy clade for different bug features. If this type of data exists please use -w or –lastFeatureMetadata to indicate the last column of feature metadata before the first column which is a sample. For an example please look in docs for PCL-Description.txt. This only applys to PCL files.

    > python MicroPITA.py –lastmeta Label -m representative -w taxonomy_5 input/FeatureMetadata.pcl output.txt

    MicroPITA assumes the first row of the input file is the sample IDs, if it is not you may use –id to indicate the row.
    –id expects the entry in the first column of your input file that matches the row used as Sample Ids. See the input file and the following command as an example.
    This only applys to PCL files.

    > python MicroPITA.py –id Sample –lastmeta Label -m representative input/Test.pcl output.txt

    MicroPITA assumes the input file is TAB delimited, we strongly recommend you use this convention. If not, you can use –delim to change the delimiter used to read in the file.
    Here is an example of reading the comma delimited file micropita/input/CommaDelim.pcl
    This only applys to PCL files.

    > python MicroPITA.py –delim , –lastmeta Label -m representative input/CommaDelim.pcl output.txt

    MicroPITA assumes the input file has feature names in which, if the name contains the consensus lineage or full taxonomic hierarchy, it is delimited with a pipe “|”. We strongly recommend you use this default. The delimiter of the feature name can be changed using –featdelim. Here is an example of reading in a file with periods as the delimiter.
    This only applys to PCL files.

    > python MicroPITA.py –featdelim . –lastmeta Label -m representative input/PeriodDelim.pcl output.txt

    Dependencies

    Please note the following dependencies need to be installed for micropita to run.
    1. Python 2.x http://www.python.org/download/
    2. blist http://pypi.python.org/pypi/blist/
    3. NumPy http://numpy.scipy.org/
    4. SciPy http://www.scipy.org/
    5. PyCogent http://pycogent.sourceforge.net/install.html
    6. mlpy http://mlpy.sourceforge.net/
    7. mpi4py http://mpi4py.scipy.org/
    8. biome support http://biom-format.org/

    This covers how to use microPITA. Thank you for using this software and good luck with all your endeavors!

    All command line options and parameters

    $ python MicroPITA.py --help
    usage: MicroPITA.py [-h] [-n samples] [-m method] [-a AlphaDiversity]
                        [-b BetaDiversity] [-q AlphaDiversityMetadata]
                        [-x BetaDiversityMatrix] [-o PhylogeneticTree]
                        [-i EnvironmentFile] [-f] [-d sample_id] [-l metadata_id]
                        [-r targeting_method] [-t feature_file]
                        [-w Last_Feature_Metadata] [-e supervised_id]
                        [-s stratify_id] [-j column_delimiter]
                        [-k taxonomy_delimiter] [-v log_level] [-c output_qc]
                        [-g output_log] [-u output_scaled] [-p output_labels]
                        input.pcl/biome output.txt
    
    Selects samples from abundance tables based on various selection schemes.
    
    positional arguments:
      input.pcl/biome       Input file as either a PCL or Biome file.
      output.txt            The generated output data file.
    
    optional arguments:
      -h, --help            show this help message and exit
    
    Common:
      Commonly modified options
    
      -n samples, --num samples
                            The number of samples to select with unsupervised
                            methodology. (An integer greater than 0.).
      -m method, --method method
                            Select techniques listed one after another.
    
    Custom:
      Selecting and inputing custom metrics
    
      -a AlphaDiversity, --alpha AlphaDiversity
                            A key word for any PyCogent supplied alpha diveristy
                            metric (Richness, evenness, or diversity). Please
                            supply an unnormalized (counts) abundance table for
                            these metrics. Metrics include heip_e fisher_alpha
                            equitability menhinick simpson robbins
                            reciprocal_simpson chao1 simpson_e margalef
                            berger_parker_d observed_species brillouin_d
                            mcintosh_d mcintosh_e ACE strong dominance shannon
                            michaelis_menten_fit.
      -b BetaDiversity, --beta BetaDiversity
                            A key word for any PyCogent supplied beta diversity
                            metric. Metrics include chebyshev canberra sqeuclidean
                            braycurtis euclidean cosine hamming correlation
                            cityblock unifrac_unweighted unifrac_weighted.
      -q AlphaDiversityMetadata, --alphameta AlphaDiversityMetadata
                            Metric in the pcl file which has custom alpha
                            diversity measurements to use with the highest
                            diversity sampling criteria. Should be a number
                            between 0.0 and 1.0 with 1.0 meaning most diverse.
      -x BetaDiversityMatrix, --betamatrix BetaDiversityMatrix
                            Precalculated beta-diversity matrix to be used in the
                            representative sampling criteria. Should be a number
                            between 0.0 and 1.0 with 1.0 meaning most dissimilar.
      -o PhylogeneticTree, --tree PhylogeneticTree
                            Tree for phylogenetic when selecting custom beta-
                            diversities in the representative sampling criteria.
      -i EnvironmentFile, --envr EnvironmentFile
                            File describing the smaple environments; for use with
                            Unifrac distance metrics.
      -f, --invertDiversity
                            When using this flag, the diversity will be inverted
                            (multiplicative inverse) before ranking in the highest
                            diversity method. Recommended to use with dominance,
                            menhinick, reciprocal_simpson, berger_parker_d,
                            mcintosh_e, simpson_e, strong and any metric where 0
                            indicates most diverse.
    
    Miscellaneous:
      Row/column identifiers and feature targeting options
    
      -d sample_id, --id sample_id
                            The row in the abundance file that is the sample
                            name/id row. Should be the sample name/Id in first
                            column of the row.
      -l metadata_id, --lastmeta metadata_id
                            The row in the abundance file that is the sample
                            name/id row. Should be the metadata name/Id in first
                            column of the metadta row.
      -r targeting_method, --feature_method targeting_method
                            The ranking method used to select targeted features.
      -t feature_file, --targets feature_file
                            A file containing taxa/OTUs/clades to be used in
                            targeted feature sampling criteria.
      -w Last_Feature_Metadata, --lastFeatureMetadata Last_Feature_Metadata
                            The last metadata describing a (bug) feature (not
                            sample). Not all studies have feature metadata, if so
                            this can be ignored and not used. See doc for PCL-
                            Description.txt
    
    Data labeling:
      Metadata IDs for strata and supervised label values
    
      -e supervised_id, --label supervised_id
                            The name of the metadata on which to perform
                            supervised methods
      -s stratify_id, --stratify stratify_id
                            The metatdata to stratify unsupervised analysis.
    
    File formatting:
      Rarely modified file formatting options
    
      -j column_delimiter, --delim column_delimiter
                            The delimiter for the abundance table (default = TAB)
      -k taxonomy_delimiter, --featdelim taxonomy_delimiter
                            The delimiter for a feature name if it contains a
                            consensus sequence.
    
    Debugging:
      Debugging options - modify at your own risk!
    
      -v log_level, --logging log_level
                            Logging level which will be logged to a .log file with
                            the same name as the strOutFile (but with a .log
                            extension). Valid values are DEBUG, INFO, WARNING,
                            ERROR, or CRITICAL.
      -c output_qc, --checked output_qc
                            Before analysis abundance files are checked and a new
                            file results which analysis is perfromed on. The name
                            of the checked file can be specified of the default
                            will will be used (appending a -Checked to the end of
                            the file name).
      -g output_log, --logfile output_log
                            File path to save the logging file.
      -u output_scaled, --supinputfile output_scaled
                            The file path for the input file for supervised
                            methods.
      -p output_labels, --suppredfile output_labels
                            The file path for the predict file.
    

    Latest Versions

    1.1, 10-8-2013


    * BIOM format is now supported

    This revision is from 2013-10-8