MetaPhlAn: Metagenomic Phylogenetic Analysis
MetaPhlAn is a computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data. MetaPhlAn relies on unique clade-specific marker genes identified from 3,000 reference genomes, allowing:
- up to 25,000 reads-per-second (on one CPU) analysis speed (orders of magnitude faster compared to existing methods);
- unambiguous taxonomic assignments as the MetaPhlAn markers are clade-specific;
- accurate estimation of organismal relative abundance (in terms of number of cells rather than fraction of reads);
- species-level resolution for bacterial and archaeal organisms;
- extensive validation of the profiling accuracy on several synthetic datasets and on thousands of real metagenomes.
Please refer to the MetaPhlAn paper for additional information, validations, and examples. Also the main paper of the Human Microbiome Project uses MetaPhlAn (version 1.1) for species-level metagenomic profiling.
The MetaPhlAn practical tutorial can also be a useful starting point to familiarize with the this and related software for metagenomic data analysis.
News (version 1.7.1 through 1.7.7)

MetaPhlAn output tables can now be converted to BIOM format for downstream additional analyses of the taxonomic abundance profiles (thanks to Sami Pietilä for the script implementation)

MetaPhlAn now allows to export marker-based presence/absence and abundance ("-t marker_pres_table" and "-t marker_ab_table")

The script metaphlan_hclust_heatmap.py in https://bitbucket.org/nsegata/metaphlan/src/ecf86f2c2a19d0f9c129963e2fb422e3ec33f965/plotting_scripts/ allows you to generate hierarchical clustering and heatmap visualization of multiple MetaPhlAn profiles.

MetaPhlAn now supports imput metagenomes from the standard input (when Using BowTie2)

Multiple MetaPhlAn output obtained on distinct samples can now be easily merged into a unique "microbial clades vs samples" abundance table. Many thanks to Tim Tickle for implementing the merge_metaphlan_tables.py script included in the current (version 1.7.2) repository.
Additionally, the MetaPhlAn output can now be visualized with Krona and with other software supporting the PhyloXML and BIOM format. This is possible using the conversion scripts included now in the MetaPhlAn package, kindly provided by Daniel Brami, James Taylor, and Sami Pietilä. A Krona visualization of the LC1 synthetic metagenome (see below) is available here.

The last version of MetaPhlAn (1.7) supports the (multi)fastq input format (when using BowTie2). This will allow more sensitive and precise mapping and avoids the need of converting the fastq file to fasta. With this modification, you can basically apply MetaPhlAn directly to the first sequencing output as low quality reads will not succesfully mapped to any marker and the same for non-microbial reads without affecting the abundances of the predicted microbial clades. No changes in the command line are required for using fastq files, but a non-local hit policy for BowTie2 (i.e. '--bt2_ps very-sensitive') is recommended for avoiding overly-sensitive hits when using non-preprocessed fastq files).
Obtaining MetaPhlAn (version 1.7.7)
The MetaPhlAn classifier can be obtained downloading the main MetaPhlAn script (metaphlan.py) here (2.5MB) together with the Blast database of the MetaPhlAn markers (3 files, 95MBs) or/and the BowTie2 database of the MetaPhlAn markers (6 files, 500MBs).
You can also obtain the whole package using hg:
$ hg clone https://bitbucket.org/nsegata/metaphlan
or downloading the compressed archive in zip, gz, or bz2 format.
Updates, tutorials and mailing list
Software updates are regularly posted on the bitbucket repository. You are more than welcome to use the Issue Tracking system on Bitbucket (or email us) to provide feedback, report bugs, and suggest/request new features.
If you would like to be notified about new version, new features, or any other news related to MetaPhlAn please join our mailing list:
We also started a FAQ page. For having new questions addressed in the FAQ page, and for any comments or questions feel free to contact us.
A tutorial guiding the user into the application of MetaPhlAn and into some downstream additional analyses is available here.
Contacts and citation
MetaPhlAn has been successfully applied on hundreds of metagenomes for more than 10 billion reads as described in the following paper that we kindly ask to cite if you find MetaPhlAn useful in your research:
"Metagenomic microbial community profiling using unique clade-specific marker genes"
Nicola Segata, Levi Waldron, Annalisa Ballarini, Vagheesh Narasimhan, Olivier Jousson, Curtis Huttenhower.
Nature Methods, 8, 811–814, 2012
Here is an infographic of the application of the Human Microbiome Project results obtained applying MetaPhlAn on the 690 shotgun sequencing samples. Email me for a high-resolution version. This infographic also appears in a slightly modified version as the main illustration of a New York Times article by Carl Zimmer available here (NY Times subscription needed) and here (NY Times copyrighted version), and as Figure 2 in our Trends in Genomics review paper.

Galaxy interface for MetaPhlAn
From version 1.6, MetaPhlAn is also available inside Galaxy at http://huttenhower.sph.harvard.edu/galaxy
Please note that the Galaxy module is providing fewer options than the command
line version because it is meant to be as easy-to-use as possible and, for
computational reasons, only the BowTie2 internal reads-to-markers matching
engine is supported.
If you like, feel free to add MetaPhlAn into your local Galaxy installation. The source file for the galaxy module is here together with detailed information about the operation. The module is also available in the Galaxy Tool Shed
Many thanks to George Weingart, Dannon Baker, and James Taylor for integrating MetaPhlAn into Galaxy!
Supplementary material
We report here the supplementary material for the MetaPhlAn article, which consists of the following files:
- The MetaPhlAn 1.0 unique marker database
- The MetaPhlAn 1.0 profiling of the 51 vaginal metagenomic samples from the Human Microbiome Project. All taxa with an abundance of 0.5% in at least one sample are reported.
- The MetaPhlAn 1.0 profiling of the 224 gut metagenomic samples from asymptomatic subjects seuqenced by the HMP and MetaHIT projects. All taxa with an abundance of 0.5% in at least one sample are reported.
MetaPhlAn profiling of large metagenomic cohorts
We profiled nearly 1,000 metagenomes so far from diverse environments including several human bodysites, mouse guts, several marine sites, soil samples, and food-associated communities.
The 690 Human Microbiome Project metagenomic samples from 5 major human body areas are also available as HMP final results and a graphical summary is reported in the heatmap below.
We also profiled the 124 MetaHIT gut metagenomes and we provide the obtained table of abundances.
If you need to profile a large metagenomic cohort, we are happy to share our experience in doing that.
The following heatmap represents the species-level taxonomic composition of the 690 Human Microbiome Project shotgun sequencing samples. For space reasons, only the 64 most abundant species are reported.

Synthetic metagenomes for validation
We used 10 synthetic metagenomes to evaluate MetaPhlAn (in addition to other real-word benchmarking metagenomes). We generated two high complexity (100 different organisms) evenly distributed (all organisms are at 1% relative abundance) called HC1 and HC2 with 1,000,000 noisy short reads each. Eight low-complecity (25 organisms) non-evenly distributed (log-normal model) metagenomes comprising 250,000 reads each have also been developed and are available for doownload: LC1, LC2, LC3, LC4, LC5, LC6, LC7, LC8.
Common commands
Profiling a metagenome from raw reads using Blast (requires BLAST 2.2.25+ and the blast marker DB provided with MetaPhlAn):
$ metaphlan.py metagenome.fasta --blastdb blastdb/mpa
You can save a lot of computational time if you perform the blasting using BowTie2 instead of Blast (requires BowTie2 in the system path with execution and read permissions, Perl installed, and the BowTie2 marker DB provided with MetaPhlAn):
$ metaphlan.py metagenome.fasta --bowtie2db bowtie2db/mpa
When possible, it is recommended to use fastq files that will increase the mapping accuracy with BowTie2. No changes in the command line are required for using fastq files, but a non-local hit policy for BowTie2 (i.e. '--bt2_ps sensitive') is recommended for avoiding overly-sensitive hits:
$ metaphlan.py metagenome.fastq --bowtie2db bowtie2db/mpa --bt2_ps sensitive
You can take advantage of multiple CPUs and you can save the blast output for re-running MetaPhlAn extremely quickly:
$ metaphlan.py metagenome.fasta --blastdb blastdb/mpa --nproc 5 --blastout metagenome.outfmt6.txt
and the same for BowTie2:
$ metaphlan.py metagenome.fastq --bowtie2db bowtie2db/mpa --nproc 5 --bowtie2out metagenome.bt2out.txt
If you already blasted your metagenome against the marker DB (using MetaPhlAn or blastn/BowTie2 alone) you can obtain the results in few seconds:
$ metaphlan.py --input_type blastout metagenome.outfmt6.txt
(notice that 'blastout' define a file format independent from NCBI Blast and common to BowTie2 as well)
When using BowTie2 the metagenome can also be passed from the standard input but it is necessary to specify the input format explicitly:
$ tar xjz metagenome.tar.bz2 --to-stdout | metaphlan.py --input_type multifastq --blastdb blastdb/mpa
Also the pre-computed blast/BowTie2 output can be provided with a pipe (again specifying the input type):
$ metaphlan.py --input_type blastout < metagenome.outfmt6.txt > profiling_output.txt
You can also set advanced options for the BowTie2 step selecting the preset option among 'sensitive','very-sensitive','sensitive-local','very-sensitive-local' (valid for metagenome as input only):
$ metaphlan.py --bt2_ps very-sensitive-local metagenome.fasta
Gor for Blast the main configurable option is the threshold on the evalue (default 1e-5, we strongly recommend not lowering it too much):
$ metaphlan.py --evalue 1e-7 < metagenome.outfmt6.txt > profiling_output.txt
If you suspect that the metagenome contains unknown clades, you may obtain more accurare result with a more sensible blast search lowering the blast word_size (the blasting will be slower):
$ metaphlan.py metagenome.fna --word_size 12 > profiling_output.txt
All command line options and parameters
$ metaphlan.py -h
usage: metaphlan.py [-h] [-v] [-t ANALYSIS TYPE] [--tax_lev TAXONOMIC_LEVEL]
[--nreads NUMBER_OF_READS] [--pres_th PRESENCE_THRESHOLD]
[--blastdb METAPHLAN_BLAST_DB]
[--bowtie2db METAPHLAN_BOWTIE2_DB] [--evalue]
[--word_size] [--bt2_ps BowTie2 presets] [--tmp_dir]
[--min_cu_len]
[--input_type {automatic,multifasta,multifastq,blastout}]
[--stat_q] [--blastn_exe BLASTN_EXE]
[--bowtie2_exe BOWTIE2_EXE] [--blastout FILE_NAME]
[--bowtie2out FILE_NAME] [--no_map] [-o output file]
[--nproc N]
[INPUT_FILE] [OUTPUT_FILE]
DESCRIPTION
MetaPhlAn version 1.7.7 (28 February 2013): METAgenomic PHyLogenetic ANalysis for
taxonomic classification of metagenomic reads.
AUTHORS: Nicola Segata (nsegata@hsph.harvard.edu)
COMMON COMMANDS
* Profiling a metagenome from raw reads using Blast (requires BLAST 2.2.25+ and the
blast marker DB provided with MetaPhlAn):
metaphlan.py metagenome.fasta --blastdb blastdb/mpa
* You can save a lot of computational time if you perform the blasting using BowTie2
instead of Blast (requires BowTie2 in the system path with execution and read
permissions, Perl installed, and the BowTie2 marker DB provided with MetaPhlAn):
metaphlan.py metagenome.fasta --bowtie2db bowtie2db/mpa
* When possible, it is recommended to use fastq files that will increase the mapping
accuracy with BowTie2. No changes in the command line are required for using fastq
files, but a non-local hit policy for BowTie2 (i.e. '--bt2_ps sensitive') is
recommended for avoiding overly-sensitive hits:
metaphlan.py metagenome.fastq --bowtie2db bowtie2db/mpa --bt2_ps sensitive
* you can take advantage of multiple CPUs and you can save the blast output
for re-running MetaPhlAn extremely quickly:
metaphlan.py metagenome.fasta --blastdb blastdb/mpa --nproc 5 --blastout metagenome.outfmt6.txt
and the same for BowTie2:
metaphlan.py metagenome.fastq --bowtie2db bowtie2db/mpa --nproc 5 --bowtie2out metagenome.bt2out.txt
* if you already blasted your metagenome against the marker DB (using MetaPhlAn
or blastn/BowTie2 alone) you can obtain the results in few seconds:
metaphlan.py --input_type blastout metagenome.outfmt6.txt
(notice that 'blastout' define a file format independent from NCBI Blast and
common to BowTie2 as well)
* When using BowTie2 the metagenome can also be passed from the standard input but
it is necessary to specify the input format explicitly:
tar xjz metagenome.tar.bz2 --to-stdout | metaphlan.py --input_type multifastq --blastdb blastdb/mpa
* Also the pre-computed blast/BowTie2 output can be provided with a pipe (again
specifying the input type):
metaphlan.py --input_type blastout < metagenome.outfmt6.txt > profiling_output.txt
* you can also set advanced options for the BowTie2 step selecting the preset option
among 'sensitive','very-sensitive','sensitive-local','very-sensitive-local'
(valid for metagenome as input only):
metaphlan.py --bt2_ps very-sensitive-local metagenome.fasta
* for for Blast the main configurable option is the threshold on the evalue
(default 1e-5, we strongly recommend not lowering it too much):
metaphlan.py --evalue 1e-7 < metagenome.outfmt6.txt > profiling_output.txt
* if you suspect that the metagenome contains unknown clades, you may obtain
more accurare result with a more sensible blast search lowering the blast
word_size (the blasting will be slower):
metaphlan.py metagenome.fna --word_size 12 > profiling_output.txt
positional arguments:
INPUT_FILE the input file can be:
* a multi-fasta file containing metagenomic reads
OR
* a NCBI BLAST output file (-outfmt 6 format) of the metagenome against the MetaPhlAn database.
OR
* a BowTie2 output file of the metagenome generated by a previous MetaPhlAn run
The software will recognize the format automatically.
If the input file is missing, the script assumes that the input is provided using the standard
input, and the input format has to be specified with --input_type
OUTPUT_FILE the tab-separated output file of the predicted taxon relative abundances
[stdout if not present]
optional arguments:
-h, --help show this help message and exit
-v, --version Prints the current MetaPhlAn version and exit
-t ANALYSIS TYPE Type of analysis to perform:
* rel_ab: profiling a metagenomes in terms of relative abundances
* reads_map: mapping from reads to clades (only reads hitting a marker)
* clade_profiles: normalized marker counts for clades with at least a non-null marker
* marker_ab_table: normalized marker counts (only when > 0.0 and normalized by metagenome size if --nreads is specified)
* marker_pres_table: list of markers present in the sample (threshold at 1.0 if not differently specified with --pres_th
[default 'rel_ab']
--tax_lev TAXONOMIC_LEVEL
The taxonomic level for the relative abundance output:
'a' : all taxonomic levels
'k' : kingdoms (Bacteria and Archaea) only
'p' : phyla only
'c' : classes only
'o' : orders only
'f' : families only
'g' : genera only
's' : species only
[default 'a']
--nreads NUMBER_OF_READS
The total number of reads in the original metagenome. It is used only when
-t marker_table is specified for normalizing the length-normalized counts
with the metagenome size as well. No normalization applied if --nreads is not
specified
--pres_th PRESENCE_THRESHOLD
Threshold for calling a marker present by the -t marker_pres_table option
--blastdb METAPHLAN_BLAST_DB
The blast database file of the MetaPhlAn database
--bowtie2db METAPHLAN_BOWTIE2_DB
The BowTie2 database file of the MetaPhlAn database
--evalue evalue threshold for the blasting
[default 1e-6]
--word_size word_size value for the blasting
[default NCBI BlastN default]
--bt2_ps BowTie2 presets
presets options for BowTie2 (applied only when a multifasta file is provided)
The choices enabled in MetaPhlAn are:
* sensitive
* very-sensitive
* sensitive-local
* very-sensitive-local
[default very-sensitive-local]
--tmp_dir the folder used to store temporary files
[default is the OS dependent tmp dir]
--min_cu_len minimum total nucleotide length for the markers in a clade for
estimating the abundance without considering sub-clade abundances
[default 10000]
--input_type {automatic,multifasta,multifastq,blastout}
set wheter the input is the multifasta file of metagenomic reads or
the blast output (outfmt 6 format) of the reads against the MetaPhlAn db.
[default 'automatic', i.e. the script will try to guess the input format]
--stat_q Quantile value for the robust average
[default 0.1]
--blastn_exe BLASTN_EXE
Full path and name of the blastn executable. This option allows
MetaPhlAn to reach the executable even when it is not in the system
PATH or the system PATH is unreachable
--bowtie2_exe BOWTIE2_EXE
Full path and name of the BowTie2 executable. This option allows
MetaPhlAn to reach the executable even when it is not in the system
PATH or the system PATH is unreachable
--blastout FILE_NAME The file for saving the output of the blasting (in outfmt 6 format)
--bowtie2out FILE_NAME
The file for saving the output of BowTie2
--no_map Avoid storing the --blastout (or --bowtie2out) map file
-o output file, --output_file output file
The output file (if not specified as positional argument)
--nproc N The number of CPUs to use for parallelizing the blasting
[default 1, i.e. no parallelism]
Acknowledgements
The authors of MetaPhlAn would like to thank Frank Stewart and Ed DeLong for their helpful input during this study, Dirk Gevers, Katherine Huang, and Sean Sykes for feedback on the methodology, Joshua Reyes for his help with validation and testing, George Weingart, Dannon Baker, and James Taylor for their work for integrating MetaPhlAn into Galaxy, Tim Tickle, Daniel Brami, and Sami Pietilä for providing conversion scripts.
Change log
Changes in Version 1.7.7 (28 February 2013)
New supporting feature - the script conversion_scripts/metaphlan2biom.py in conversion_scripts/ kindly implemented by Sami Pietila allows you to export MetaPhlAn output tables (merged with utils/merge_metaphlan_tables.py) into the BIOM format. The BIOM format is the standard format for several post-processing programs and analyses such as Qiime.
Changes in Version 1.7.6 (27 January 2013)
Bug fix - now supporting --input_type bowtie2out that need to be specified as --input_type blastout before
Changes in Version 1.7.5 (14 January 2013)
New supporting features: - "-t marker_pres_table" outputs the list of markers with non-null abundance (i.e. present in the sample). The threshold for presence/absence call is specified via "--pres_th" - "-t marker_ab_table" outputs the normalize marker counts (multiplied by 1,000). The value is normalized by the size of the metagenome (to compare samples with different sequencing depth) if --nreads is specified
Changes in Version 1.7.4 (31 December 2012)
New supporting feature: - the script metaphlan_hclust_heatmap.py in plotting_scripts/ allows you to generate hierarchical clustering and heatmap visualization of multiple MetaPhlAn profiles. See plotting_scripts/readme.txt for details
Changes in Version 1.7.3 (10 December 2012)
New feature:
- MetaPhlAn now accepts metagenomes from the standard input (when using BowTie2). When passing the input from the standard input, it is now required to specify the input type ('multifasta', 'multifastq', or 'blastout'). When using a file as input, the input type is still guessed by MetaPhlAn automatically.
Changes in Version 1.7.2 (23 October 2012)
New supporting features - multiple MetaPhlAn output obtained on distinct samples can now be easily merged into a unique "microbial clades vs samples" abundance table using the merge_metaphlan_tables.py script in the "utils" folder (thanks to Tim Tickle) - the MetaPhlAn output can now be visualized with Krona and with other software supporting the PhyloXML format, using the scritps in the "conversion_scripts" folder (thanks to Daniel Brami and James Taylor)
Changes in Version 1.7.1 (15 October 2012)
Bug Fix - two aspecific genes removed from the database (thanks to Nick Loman)
Changes in version 1.7.0 (26 July 2012)
New features - added support for fastq files. Option available for analysis with BowTie2 only as blastn does not support fastq format. When possible, it is recommended to use fastq rather than fasta because BowTie2 will take into account the single nucleotide quality score contained in fastq format in the alignment evaluation. MetaPhlAn will automatically detect the fastq format so no changes in the command line options are required when using fastq files.
Changes in version 1.6.0 (28 May 2012)
New features: - added a Galaxy module that integrates MetaPhlAn into the galaxy environment. Source code and instructions for the module are provided, and the module currently runs at http://huttenhower.sph.harvard.edu/galaxy - added --blastn_exe and --bowtie2_exe command line options for specifying the absolute path of the blastn and bowtie2 executables. This is useful when MetaPhlAn is used in a system without root priviledges and thus blastn and and bowtie2 have to be installed locally and not included into the system path. - added the --no_map command line option for avoiding the blastn or bowtie2 output file to be saved on disk
Changes in version 1.5.2 (13 April 2012)
Bug fix: - division by zero bug fixed: the bug was occurring when no hits were detected for any of the reads
Changes in version 1.5.1 (5 April 2012)
New options: - option -o added for specifying the output file with a non-positional command line argument
Changes in version 1.5.0 (April 2012)
New Features: - added support for BowTie2 in addition to Blast. Bowtie2 runs at 10,000 reads-per-second (20,000 with 'sensitive' instead of 'very-sensitive-local') with accuracy performances at least comparable to blastn -- '--bowtie2out' and '--bowtie2db' are the BowTie2 options corresponding to the '--blastout' and '--blastdb' blast options -- support for the BowTie2 metaphlan output which is is simply a two-column file of reads/marker assignements. If called outside MetaPhlAn, BowTie2 should be called as bowtie2 --quiet --sam-no-hd --sam-no-sq --very-sensitive-local --local -x bowtie2db/mpa -f -U metagenome.fna | cut -f 1,3 | grep -P -v "\*$" -- the BowTie2 MetaPhlAn database in in bowtie2db, is called mpa, and consists of 6 files -- four preset BowTie2 options are available (--bt2_ps option): --very-sensitive, --very-sensitive, --sensitive-local, --very-sensitive-local - help message 'metaphlan.py -h' expanded - '--version' option added Bug fixes: - Fixed the error message for missing blastdb file (thanks to Joshua Reyes)
Changes in version 1.1.0
- no need to provide the MetaPhlAn data file anymore (thus --lib_dir has been removed) - a robust average is introduced to more accurately estimate taxonomic relative abundances and dramatically lower false positives (--stat_q specifies the quantile for the truncated mean) - the "word_size" option for blastn has been added in order to handle more accurately metagenomes without closely related reference genomes - a new analysis type (-t clade_profiles) has been introduced in order to estimate normalized read counts for markers in each clade - when using MetaPhlAn on blasting outputs, the input file can be provided using a pipe - a more comprehensive help message has been added (see metaphlan.py -h) - the code has been re-factored and made more robust for future developments - a slightly modified blastdb has been developed. As a consequence, the database (both multifasta and blastdb) of version 1.0 is not compatible with version 1.1
