MetaPhlAn Tutorial

** The tutorial slides are now available here **

This page contains information about the tutorials on MetaPhlAn and LEfSe that will take place at the Symposium and Workshop on New Methods for Phylogenomics and Metagenomics at The University of Texas at Austin, February 16 and 17, 2013. MetaPhlAn and LEfSe were created by Nicola Segata and collaborators in the Huttenhower Lab at the Harvard School of Public Health.

For more information about this tutorial, please contact the presenter, Eric Franzosa (franzosa@hsph.harvard.edu).

Tutorial Part I: MetaPhlAn


MetaPhlAn (Metagenomic Phylogenetic Analysis) is a computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data. MetaPhlAn relies on unique clade-specific marker genes identified from 3,000 reference genomes, allowing:

  • up to 25,000 reads-per-second (on one CPU) analysis speed (orders of magnitude faster compared to existing methods);
  • unambiguous taxonomic assignments as the MetaPhlAn markers are clade-specific;
  • accurate estimation of organismal relative abundance (in terms of number of cells rather than fraction of reads);
  • species-level resolution for bacterial and archaeal organisms;
  • extensive validation of the profiling accuracy on several synthetic datasets and on thousands of real metagenomes.
  • Coming soon: Simultaneous profiling of the bacterial, archaeal, fungal, and viral domains of life.
  • Coming soon: Automatic visualization and analysis of the profiled metagenomes.

Getting started with MetaPhlAn

You can try out MetaPhlAn using the Huttenhower Lab Galaxy webserver. To run MetaPhlAn on your own computer, download the source code from the MetaPhlAn website, or from the MetaPhlAn Bitbucket repository. MetaPhlAn requires python 2.7 or higher with the argparse, tempfile, and numpy libraries installed. MetaPhlAn requires either BLAST+ or Bowtie2 installed in your path for mapping reads to marker data.

Sample Data

The sample data include multifasta files containing down-sampled metagenomic reads derived from the body sites of a participant in the Human Microbiome Project. You can analyze these data with MetaPhlAn to profile the relative taxonomic abundances of the microbial communities present at these sites.

Additional Resources

You can learn more about MetaPhlAn from the main MetaPhlAn website, the original MetaPhlAn paper, and the MetaPhlAn usergroup.

Tutorial Part II: LEfSe


In the second part of the tutorial, we will explore the output of MetaPhlAn analyses using LEfSe. LEfSe (LDA Effect Size) is an algorithm for high-dimensional biomarker discovery and explanation. LEfSe identifies genomic features characterizing the differences between two or more biological conditions or classes. LEfSe allows researchers to identify differentially abundant features that are also consistent with biologically meaningful categories (subclasses) by:

  • first robustly identifying features that are statistically different among biological classes;
  • then performing additional tests to assess whether these differences are consistent with respect to expected biological behavior;
  • graphically reporting the discovered biomarkers and their effect sizes;
  • displaying reports in cladogram format for hierarchically organized features (e.g. MetaPhlAn output);
  • plotting additional histograms illustrating feature abundance information with emphasis on class and subclass structure.

While we will focus today on analyzing taxonomic profiles inferred from metagenomic sequencing, LEfSe is suitable for analyzing a range of meta'omics profile data, including 16S-sequencing, metatranscriptomics, metaproteomics, and metametabolomics.

Getting started with LEfSe

You can try out LEfSe using the Huttenhower Lab Galaxy webserver. To run LEfSe on your own computer, download the source code from the LEfSe Bitbucket repository. LEfSe requires: R with the splines, stats4, survival, mvtnorm, modeltools, coin, and MASS libraries; and python 2.7+ with the rpy2, numpy, matplotlib, and argparse libraries.

Sample Data

The sample data include a table summarizing the results of running MetaPhlAn on metagenomic data from the body sites of multiple participants in the Human Microbiome Project. Each column represents the taxonomic profile of a single body site in a single participant. You can use LEfSe to identify taxa that are differentially abundant between the body sites across individuals.

Additional Resources

You can learn more about LEfSe from the original paper.