MetaPhlAn: Metagenomic Phylogenetic Analysis

MetaPhlAn: Metagenomic Phylogenetic Analysis

MetaPhlAn is a computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data. MetaPhlAn relies on unique clade-specific marker genes identified from 3,000 reference genomes, allowing:

  • up to 25,000 reads-per-second (on one CPU) analysis speed (orders of magnitude faster compared to existing methods);
  • unambiguous taxonomic assignments as the MetaPhlAn markers are clade-specific;
  • accurate estimation of organismal relative abundance (in terms of number of cells rather than fraction of reads);
  • species-level resolution for bacterial and archaeal organisms;
  • extensive validation of the profiling accuracy on several synthetic datasets and on thousands of real metagenomes.

Please refer to the MetaPhlAn paper for additional information, validations, and examples. Also the main paper of the Human Microbiome Project uses MetaPhlAn (version 1.1) for species-level metagenomic profiling.

The MetaPhlAn practical tutorial can also be a useful starting point to familiarize with the this and related software for metagenomic data analysis.


News (version 1.7.1 through 1.7.8)

new.jpg
MetaPhlAn can now generate directly biom output using the --biom_output_file. ("--biom_output_file analysis_file.biom")

new.jpg
MetaPhlAn output tables can now be converted to BIOM format for downstream additional analyses of the taxonomic abundance profiles (thanks to Sami Pietilä for the script implementation)

new.jpg
MetaPhlAn now allows to export marker-based presence/absence and abundance ("-t marker_pres_table" and "-t marker_ab_table")

new.jpg
The script metaphlan_hclust_heatmap.py in https://bitbucket.org/nsegata/metaphlan/src/ecf86f2c2a19d0f9c129963e2fb422e3ec33f965/plotting_scripts/ allows you to generate hierarchical clustering and heatmap visualization of multiple MetaPhlAn profiles.

new.jpg
MetaPhlAn now supports imput metagenomes from the standard input (when Using BowTie2)

new.jpg
Multiple MetaPhlAn output obtained on distinct samples can now be easily merged into a unique "microbial clades vs samples" abundance table. Many thanks to Tim Tickle for implementing the merge_metaphlan_tables.py script included in the current (version 1.7.2) repository.

Additionally, the MetaPhlAn output can now be visualized with Krona and with other software supporting the PhyloXML and BIOM format. This is possible using the conversion scripts included now in the MetaPhlAn package, kindly provided by Daniel Brami, James Taylor, and Sami Pietilä. A Krona visualization of the LC1 synthetic metagenome (see below) is available here.

new.jpg
The last version of MetaPhlAn (1.7) supports the (multi)fastq input format (when using BowTie2). This will allow more sensitive and precise mapping and avoids the need of converting the fastq file to fasta. With this modification, you can basically apply MetaPhlAn directly to the first sequencing output as low quality reads will not succesfully mapped to any marker and the same for non-microbial reads without affecting the abundances of the predicted microbial clades. No changes in the command line are required for using fastq files, but a non-local hit policy for BowTie2 (i.e. '--bt2_ps very-sensitive') is recommended for avoiding overly-sensitive hits when using non-preprocessed fastq files).


Obtaining MetaPhlAn (version 1.7.8)

The MetaPhlAn classifier can be obtained downloading the main MetaPhlAn script (metaphlan.py) here (2.5MB) together with the Blast database of the MetaPhlAn markers (3 files, 95MBs) or/and the BowTie2 database of the MetaPhlAn markers (6 files, 500MBs).

You can also obtain the whole package using hg:

$ hg clone https://bitbucket.org/nsegata/metaphlan

or downloading the compressed archive in zip, gz, or bz2 format.


Updates, tutorials and mailing list

Software updates are regularly posted on the bitbucket repository. You are more than welcome to use the Issue Tracking system on Bitbucket (or email us) to provide feedback, report bugs, and suggest/request new features.

If you would like to be notified about new version, new features, or any other news related to MetaPhlAn please join our mailing list:

MetaPhlAn google group

We also started a FAQ page. For having new questions addressed in the FAQ page, and for any comments or questions feel free to contact us.

A tutorial guiding the user into the application of MetaPhlAn and into some downstream additional analyses is available here.


Contacts and citation

MetaPhlAn has been successfully applied on hundreds of metagenomes for more than 10 billion reads as described in the following paper that we kindly ask to cite if you find MetaPhlAn useful in your research:

"Metagenomic microbial community profiling using unique clade-specific marker genes"
Nicola Segata, Levi Waldron, Annalisa Ballarini, Vagheesh Narasimhan, Olivier Jousson, Curtis Huttenhower.
Nature Methods, 8, 811–814, 2012

If you found the MetaPhlAn practical tutorial useful in your research, please cite our paper describing the pipeline:

"Computational meta’omics for microbial community studies"
Nicola Segata, Daniela Boernigen, Timothy L Tickle, Xochitl C Morgan1, Wendy S Garrett, Curtis Huttenhower.

Molecular Systems Biology, 9, article 666, pp1-15, 2013


Here is an infographic of the application of the Human Microbiome Project results obtained applying MetaPhlAn on the 690 shotgun sequencing samples. Email me for a high-resolution version. This infographic also appears in a slightly modified version as the main illustration of a New York Times article by Carl Zimmer available here (NY Times subscription needed) and here (NY Times copyrighted version), and as Figure 2 in our Trends in Genomics review paper. The image has been produced with GraPhlAn (input data is available in the software package).

HMP



Galaxy interface for MetaPhlAn

From version 1.6, MetaPhlAn is also available inside Galaxy at http://huttenhower.sph.harvard.edu/galaxy

Please note that the Galaxy module is providing fewer options than the command
line version because it is meant to be as easy-to-use as possible and, for
computational reasons, only the BowTie2 internal reads-to-markers matching
engine is supported.

If you like, feel free to add MetaPhlAn into your local Galaxy installation. The source file for the galaxy module is here together with detailed information about the operation. The module is also available in the Galaxy Tool Shed

Many thanks to George Weingart, Dannon Baker, and James Taylor for integrating MetaPhlAn into Galaxy!


Supplementary material

We report here the supplementary material for the MetaPhlAn article, which consists of the following files:

  • The MetaPhlAn 1.0 unique marker database
  • The MetaPhlAn 1.0 profiling of the 51 vaginal metagenomic samples from the Human Microbiome Project. All taxa with an abundance of 0.5% in at least one sample are reported.
  • The MetaPhlAn 1.0 profiling of the 224 gut metagenomic samples from asymptomatic subjects seuqenced by the HMP and MetaHIT projects. All taxa with an abundance of 0.5% in at least one sample are reported.

MetaPhlAn profiling of large metagenomic cohorts

We profiled nearly 1,000 metagenomes so far from diverse environments including several human bodysites, mouse guts, several marine sites, soil samples, and food-associated communities.

The 690 Human Microbiome Project metagenomic samples from 5 major human body areas are also available as HMP final results and a graphical summary is reported in the heatmap below.

We also profiled the 124 MetaHIT gut metagenomes and we provide the obtained table of abundances.

If you need to profile a large metagenomic cohort, we are happy to share our experience in doing that.

The following heatmap represents the species-level taxonomic composition of the 690 Human Microbiome Project shotgun sequencing samples. For space reasons, only the 64 most abundant species are reported.
HMP


Synthetic metagenomes for validation

We used 10 synthetic metagenomes to evaluate MetaPhlAn (in addition to other real-word benchmarking metagenomes). We generated two high complexity (100 different organisms) evenly distributed (all organisms are at 1% relative abundance) called HC1 and HC2 with 1,000,000 noisy short reads each. Eight low-complecity (25 organisms) non-evenly distributed (log-normal model) metagenomes comprising 250,000 reads each have also been developed and are available for doownload: LC1, LC2, LC3, LC4, LC5, LC6, LC7, LC8.


Common commands

Profiling a metagenome from raw reads using Blast (requires BLAST 2.2.25+ and the blast marker DB provided with MetaPhlAn):

$ metaphlan.py metagenome.fasta --blastdb blastdb/mpa

You can save a lot of computational time if you perform the blasting using BowTie2 instead of Blast (requires BowTie2 in the system path with execution and read permissions, Perl installed, and the BowTie2 marker DB provided with MetaPhlAn):

$ metaphlan.py metagenome.fasta --bowtie2db bowtie2db/mpa

When possible, it is recommended to use fastq files that will increase the mapping accuracy with BowTie2. No changes in the command line are required for using fastq files, but a non-local hit policy for BowTie2 (i.e. '--bt2_ps sensitive') is recommended for avoiding overly-sensitive hits:

$ metaphlan.py metagenome.fastq --bowtie2db bowtie2db/mpa --bt2_ps sensitive

You can take advantage of multiple CPUs and you can save the blast output for re-running MetaPhlAn extremely quickly:

$ metaphlan.py metagenome.fasta --blastdb blastdb/mpa --nproc 5 --blastout metagenome.outfmt6.txt

and the same for BowTie2:

$ metaphlan.py metagenome.fastq --bowtie2db bowtie2db/mpa --nproc 5 --bowtie2out metagenome.bt2out.txt

If you already blasted your metagenome against the marker DB (using MetaPhlAn or blastn/BowTie2 alone) you can obtain the results in few seconds:

$ metaphlan.py --input_type blastout metagenome.outfmt6.txt

(notice that 'blastout' define a file format independent from NCBI Blast and common to BowTie2 as well)

When using BowTie2 the metagenome can also be passed from the standard input but it is necessary to specify the input format explicitly:

$ tar xjz metagenome.tar.bz2 --to-stdout | metaphlan.py --input_type multifastq --blastdb blastdb/mpa

Also the pre-computed blast/BowTie2 output can be provided with a pipe (again specifying the input type):

$ metaphlan.py --input_type blastout < metagenome.outfmt6.txt > profiling_output.txt

You can also set advanced options for the BowTie2 step selecting the preset option among 'sensitive','very-sensitive','sensitive-local','very-sensitive-local' (valid for metagenome as input only):

$ metaphlan.py --bt2_ps very-sensitive-local metagenome.fasta

Gor for Blast the main configurable option is the threshold on the evalue (default 1e-5, we strongly recommend not lowering it too much):

$ metaphlan.py --evalue 1e-7 < metagenome.outfmt6.txt > profiling_output.txt

If you suspect that the metagenome contains unknown clades, you may obtain more accurare result with a more sensible blast search lowering the blast word_size (the blasting will be slower):

$ metaphlan.py metagenome.fna --word_size 12 > profiling_output.txt

All command line options and parameters

$ metaphlan.py -h

usage: metaphlan.py [-h] [-v] [-t ANALYSIS TYPE] [--tax_lev TAXONOMIC_LEVEL]
                    [--nreads NUMBER_OF_READS] [--pres_th PRESENCE_THRESHOLD]
                    [--blastdb METAPHLAN_BLAST_DB]
                    [--bowtie2db METAPHLAN_BOWTIE2_DB] [--evalue]
                    [--word_size] [--bt2_ps BowTie2 presets] [--tmp_dir]
                    [--min_cu_len]
                    [--input_type {automatic,multifasta,multifastq,blastout}]
                    [--stat_q] [--blastn_exe BLASTN_EXE]
                    [--bowtie2_exe BOWTIE2_EXE] [--blastout FILE_NAME]
                    [--bowtie2out FILE_NAME] [--no_map] [-o output file]
                    [--nproc N]
                    [INPUT_FILE] [OUTPUT_FILE]

DESCRIPTION
 MetaPhlAn version 1.7.7 (28 February 2013): METAgenomic PHyLogenetic ANalysis for
 taxonomic classification of metagenomic reads.

AUTHORS: Nicola Segata (nsegata@hsph.harvard.edu)

COMMON COMMANDS

* Profiling a metagenome from raw reads using Blast (requires BLAST 2.2.25+ and the
  blast marker DB provided with MetaPhlAn):
metaphlan.py metagenome.fasta --blastdb blastdb/mpa

* You can save a lot of computational time if you perform the blasting using BowTie2 
  instead of Blast (requires BowTie2 in the system path with execution and read 
  permissions, Perl installed, and the BowTie2 marker DB provided with MetaPhlAn):
metaphlan.py metagenome.fasta --bowtie2db bowtie2db/mpa

* When possible, it is recommended to use fastq files that will increase the mapping 
  accuracy with BowTie2. No changes in the command line are required for using fastq 
  files, but a non-local hit policy for BowTie2 (i.e. '--bt2_ps sensitive') is 
  recommended for avoiding overly-sensitive hits:
metaphlan.py metagenome.fastq --bowtie2db bowtie2db/mpa --bt2_ps sensitive

* you can take advantage of multiple CPUs and you can save the blast output
   for re-running MetaPhlAn extremely quickly:
metaphlan.py metagenome.fasta --blastdb blastdb/mpa --nproc 5 --blastout metagenome.outfmt6.txt
  and the same for BowTie2:
metaphlan.py metagenome.fastq --bowtie2db bowtie2db/mpa --nproc 5 --bowtie2out metagenome.bt2out.txt

* if you already blasted your metagenome against the marker DB (using MetaPhlAn
   or blastn/BowTie2 alone) you can obtain the results in few seconds:
metaphlan.py --input_type blastout metagenome.outfmt6.txt
  (notice that 'blastout' define a file format independent from NCBI Blast and 
  common to BowTie2 as well)

* When using BowTie2 the metagenome can also be passed from the standard input but 
  it is necessary to specify the input format explicitly:
tar xjz metagenome.tar.bz2 --to-stdout | metaphlan.py --input_type multifastq --blastdb blastdb/mpa

* Also the pre-computed blast/BowTie2 output can be provided with a pipe (again 
  specifying the input type): 
metaphlan.py --input_type blastout < metagenome.outfmt6.txt > profiling_output.txt

* you can also set advanced options for the BowTie2 step selecting the preset option 
  among 'sensitive','very-sensitive','sensitive-local','very-sensitive-local' 
  (valid for metagenome as input only):
metaphlan.py --bt2_ps very-sensitive-local metagenome.fasta

* for for Blast the main configurable option is the threshold on the evalue
   (default 1e-5, we strongly recommend not lowering it too much):
metaphlan.py --evalue 1e-7 < metagenome.outfmt6.txt > profiling_output.txt

* if you suspect that the metagenome contains unknown clades, you may obtain
   more accurare result with a more sensible blast search lowering the blast
   word_size (the blasting will be slower):
metaphlan.py metagenome.fna --word_size 12 > profiling_output.txt

positional arguments:
  INPUT_FILE            the input file can be:
                        * a multi-fasta file containing metagenomic reads
                        OR
                        * a NCBI BLAST output file (-outfmt 6 format) of the metagenome against the MetaPhlAn database. 
                        OR
                        * a BowTie2 output file of the metagenome generated by a previous MetaPhlAn run 
                        The software will recognize the format automatically.
                        If the input file is missing, the script assumes that the input is provided using the standard 
                        input, and the input format has to be specified with --input_type
  OUTPUT_FILE           the tab-separated output file of the predicted taxon relative abundances 
                        [stdout if not present]

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         Prints the current MetaPhlAn version and exit
  -t ANALYSIS TYPE      Type of analysis to perform: 
                         * rel_ab: profiling a metagenomes in terms of relative abundances
                         * reads_map: mapping from reads to clades (only reads hitting a marker)
                         * clade_profiles: normalized marker counts for clades with at least a non-null marker
                         * marker_ab_table: normalized marker counts (only when > 0.0 and normalized by metagenome size if --nreads is specified)
                         * marker_pres_table: list of markers present in the sample (threshold at 1.0 if not differently specified with --pres_th
                        [default 'rel_ab']
  --tax_lev TAXONOMIC_LEVEL
                        The taxonomic level for the relative abundance output:
                        'a' : all taxonomic levels
                        'k' : kingdoms (Bacteria and Archaea) only
                        'p' : phyla only
                        'c' : classes only
                        'o' : orders only
                        'f' : families only
                        'g' : genera only
                        's' : species only
                        [default 'a']
  --nreads NUMBER_OF_READS
                        The total number of reads in the original metagenome. It is used only when 
                        -t marker_table is specified for normalizing the length-normalized counts 
                        with the metagenome size as well. No normalization applied if --nreads is not 
                        specified
  --pres_th PRESENCE_THRESHOLD
                        Threshold for calling a marker present by the -t marker_pres_table option
  --blastdb METAPHLAN_BLAST_DB
                        The blast database file of the MetaPhlAn database 
  --bowtie2db METAPHLAN_BOWTIE2_DB
                        The BowTie2 database file of the MetaPhlAn database 
  --evalue              evalue threshold for the blasting
                        [default 1e-6]
  --word_size           word_size value for the blasting
                        [default NCBI BlastN default]
  --bt2_ps BowTie2 presets
                        presets options for BowTie2 (applied only when a multifasta file is provided)
                        The choices enabled in MetaPhlAn are:
                         * sensitive
                         * very-sensitive
                         * sensitive-local
                         * very-sensitive-local
                        [default very-sensitive]
  --tmp_dir             the folder used to store temporary files 
                        [default is the OS dependent tmp dir]
  --min_cu_len          minimum total nucleotide length for the markers in a clade for
                        estimating the abundance without considering sub-clade abundances
                        [default 10000]
  --input_type {automatic,multifasta,multifastq,blastout}
                        set wheter the input is the multifasta file of metagenomic reads or 
                        the blast output (outfmt 6 format) of the reads against the MetaPhlAn db.
                        [default 'automatic', i.e. the script will try to guess the input format]
  --stat_q              Quantile value for the robust average
                        [default 0.1]
  --blastn_exe BLASTN_EXE
                        Full path and name of the blastn executable. This option allows 
                        MetaPhlAn to reach the executable even when it is not in the system 
                        PATH or the system PATH is unreachable
  --bowtie2_exe BOWTIE2_EXE
                        Full path and name of the BowTie2 executable. This option allows 
                        MetaPhlAn to reach the executable even when it is not in the system 
                        PATH or the system PATH is unreachable
  --blastout FILE_NAME  The file for saving the output of the blasting (in outfmt 6 format)
  --bowtie2out FILE_NAME
                        The file for saving the output of BowTie2
  --no_map              Avoid storing the --blastout (or --bowtie2out) map file
  -o output file, --output_file output file
                        The output file (if not specified as positional argument)
  --nproc N             The number of CPUs to use for parallelizing the blasting
                        [default 1, i.e. no parallelism]

  --biom_output_file    If requesting biom file output: The name of the output file in biom format

  --metadata_delimiter_char   
                        If requesting biom file output: Delimiter for bug metadata: - defaults to pipe
                        e.g. the pipe in k__Bacteria|p__Proteobacteria

Acknowledgements

The authors of MetaPhlAn would like to thank Frank Stewart and Ed DeLong for their helpful input during this study, Dirk Gevers, Katherine Huang, and Sean Sykes for feedback on the methodology, Joshua Reyes for his help with validation and testing, George Weingart, Dannon Baker, and James Taylor for their work for integrating MetaPhlAn into Galaxy, Tim Tickle, Daniel Brami, and Sami Pietilä for providing conversion scripts.


Change log

Changes in Version 1.7.8 (27 January 2014)

New supporting feature
- Generates directly biom output if requested, using the --biom_output_file parameter 

Changes in Version 1.7.7 (28 February 2013)

New supporting feature
- the script conversion_scripts/metaphlan2biom.py in conversion_scripts/ kindly implemented by Sami Pietila allows you to export MetaPhlAn output tables (merged with utils/merge_metaphlan_tables.py) into the BIOM format. The BIOM format is the standard format for several post-processing programs and analyses such as Qiime.

Changes in Version 1.7.6 (27 January 2013)

Bug fix
- now supporting --input_type bowtie2out that need to be specified as --input_type blastout before

Changes in Version 1.7.5 (14 January 2013)

New supporting features:
- "-t marker_pres_table" outputs the list of markers with non-null abundance (i.e. present in the sample). The threshold for presence/absence call is specified via "--pres_th"
- "-t marker_ab_table" outputs the normalize marker counts (multiplied by 1,000). The value is normalized by the size of the metagenome (to compare samples with different sequencing depth) if --nreads is specified

Changes in Version 1.7.4 (31 December 2012)

New supporting feature:
- the script metaphlan_hclust_heatmap.py in plotting_scripts/ allows you to generate hierarchical clustering and heatmap visualization of multiple MetaPhlAn profiles. See plotting_scripts/readme.txt for details

Changes in Version 1.7.3 (10 December 2012)

New feature:
- MetaPhlAn now accepts metagenomes from the standard input (when using BowTie2). When passing the input from the standard input, it is now required to specify the input type ('multifasta', 'multifastq', or 'blastout'). When using a file as input, the input type is still guessed by MetaPhlAn automatically.

Changes in Version 1.7.2 (23 October 2012)

New supporting features
- multiple MetaPhlAn output obtained on distinct samples can now be easily merged into a unique "microbial clades vs samples" abundance table using the merge_metaphlan_tables.py script in the "utils" folder (thanks to Tim Tickle)
- the MetaPhlAn output can now be visualized with Krona and with other software supporting the PhyloXML format, using the scritps in the "conversion_scripts" folder (thanks to Daniel Brami and James Taylor)

Changes in Version 1.7.1 (15 October 2012)

Bug Fix
- two aspecific genes removed from the database (thanks to Nick Loman)

Changes in version 1.7.0 (26 July 2012)

New features
- added support for fastq files. Option available for analysis with BowTie2 only as blastn does not support fastq format. When possible, it is recommended to use fastq rather than fasta because BowTie2 will take into account the single nucleotide quality score contained in fastq format in the alignment evaluation. MetaPhlAn will automatically detect the fastq format so no changes in the command line options are required when using fastq files.

Changes in version 1.6.0 (28 May 2012)

New features:
- added a Galaxy module that integrates MetaPhlAn into the galaxy environment. Source code and instructions for the module are provided, and the module currently runs at http://huttenhower.sph.harvard.edu/galaxy
- added --blastn_exe and --bowtie2_exe command line options for specifying the absolute path of the blastn and bowtie2 executables. This is useful when MetaPhlAn is used in a system without root priviledges and thus blastn and and bowtie2 have to be installed locally and not included into the system path.
- added the --no_map command line option for avoiding the blastn or bowtie2 output file to be saved on disk

Changes in version 1.5.2 (13 April 2012)

Bug fix:
- division by zero bug fixed: the bug was occurring when no hits were detected for any of the reads

Changes in version 1.5.1 (5 April 2012)

New options:
- option -o added for specifying the output file with a non-positional command line argument

Changes in version 1.5.0 (April 2012)

New Features:
- added support for BowTie2 in addition to Blast. Bowtie2 runs at 10,000 reads-per-second (20,000 with 'sensitive' instead of 'very-sensitive-local') with accuracy performances at least comparable to blastn
-- '--bowtie2out' and '--bowtie2db' are the BowTie2 options corresponding to the '--blastout' and '--blastdb' blast options
-- support for the BowTie2 metaphlan output which is is simply a two-column file of reads/marker assignements. If called outside MetaPhlAn, BowTie2 should be called as
   bowtie2 --quiet --sam-no-hd --sam-no-sq --very-sensitive-local --local -x bowtie2db/mpa -f -U metagenome.fna | cut -f 1,3 | grep -P -v "\*$"
-- the BowTie2 MetaPhlAn database in in bowtie2db, is called mpa, and consists of 6 files
-- four preset BowTie2 options are available (--bt2_ps option): --very-sensitive, --very-sensitive, --sensitive-local, --very-sensitive-local
- help message 'metaphlan.py -h' expanded
- '--version' option added

Bug fixes:
- Fixed the error message for missing blastdb file (thanks to Joshua Reyes)

Changes in version 1.1.0

- no need to provide the MetaPhlAn data file anymore (thus --lib_dir has been removed)
- a robust average is introduced to more accurately estimate taxonomic relative abundances and dramatically lower false positives (--stat_q specifies the quantile for the truncated mean)
- the "word_size" option for blastn has been added in order to handle more accurately metagenomes without closely related reference genomes
- a new analysis type (-t clade_profiles) has been introduced in order to estimate normalized read counts for markers in each clade
- when using MetaPhlAn on blasting outputs, the input file can be provided using a pipe
- a more comprehensive help message has been added (see metaphlan.py -h)
- the code has been re-factored and made more robust for future developments
- a slightly modified blastdb has been developed. As a consequence, the database (both multifasta and blastdb) of version 1.0 is not compatible with version 1.1

Updated