Lateral gene transfer (LGT) is an important mechanism for genome diversification in microbial communities, including the human microbiome. While methods exist to identify LGTs from sequenced isolate genomes, identifying LGTs from community metagenomes remains an open problem. To address this, we developed WAAFLE: a Workflow to Annotate Assemblies and Find LGT Events.

User Manual || User Tutorial || Forum


The WAAFLE manuscript has been submitted!

Tiffany Y. Hsu*, Etienne Nzabarushimana*, Dennis Wong, Chengwei Luo, Robert G. Beiko, Morgan Langille, Curtis Huttenhower, Long H. Nguyen**, Eric A. Franzosa**.

Profiling novel lateral gene transfer events in the human microbiome.

(Submitted.) [* = co-lead; ** = co-supervised]

In the meantime, if you use WAAFLE in your work, please cite the WAAFLE repository on GitHub: https://github.com/biobakery/waafle.

Install WAAFLE and its databases
    • Download the WAAFLE software
      • $ pip install waafle
    • Download the WAAFLE blastn database and taxonomy file
    • Unpack the blastn database
      • $ tar xzf waafledb.tar.gz
  • Python 3+ or 2.7+
  • Python numpy (tested with v1.13.3)
  • NCBI BLAST+ (tested with v2.6.0)
  • bowtie2 (for performing read-level QC; tested with v2.2.3)
Screen metagenomic contigs for LGT
  • You will need a multifasta file containing metagenomic contigs:
  • Search your contigs against the WAAFLE database:
    • $ waafle_search contigs.fna waafledb/waafledb
    • This creates contigs.blastout (BLAST hits)
  • Identify ORFs from your contigs and BLAST results:
    • $ waafle_genecaller contigs.blastout
    • This creates contigs.gff (gene calls)
  • Taxonomically classify contigs and find LGT events:
    • $ waafle_orgscorer contigs.fna contigs.blastout contigs.gff waafledb_taxonomy.tsv
    • This creates contigs.no_lgt.tsv (single-clade contigs)
    • This creates contigs.lgt.tsv (putative LGT events)
Getting started with WAAFLE

WAAFLE integrates gene sequence homology and taxonomic provenance to identify metagenomic contigs explained by pairs of microbial clades but not by single clades (i.e. putative LGTs). More specifically, for each locus in a contig, WAAFLE identifies the best hit to each species in a pangenome database. WAAFLE then looks for a species whose minimum per-locus score exceeds a lenient homology threshold (k1). If one or more species meet this criterion, then the contig is assigned to the species with the best average score. Otherwise, the process is repeated for pairs of species. If all per-locus scores for a pair of species exceed a stringent homology threshold (k2), then the contig is considered a putative LGT between those species.

Consider the following pair of examples:

Both cases consider contigs with six protein-coding loci (determined from WAAFLE itself or an independent ORF-calling program such as Prodigal). In Example 1, genes from species C are able to explain all of the loci reasonably well (with scores exceeding k1). Hence, WAAFLE will report this contig as a one-species contig explained by species C.

In Example 2, no single species can explain all of the loci (the minimum score for each species is below k1). However, the pair of species A and B have strong hits (>k2) to all loci, and so WAAFLE concludes that this contig may represent an A+B LGT. Given the AABBAA synteny pattern, a B-to-A transfer would appear to be the more likely mechanism.

Note that in Example 2, if species C had hits to the 2nd and 5th loci that exceeded k1 (as in Example 1), WAAFLE’s algorithm would conservatively favor the weaker one-species explanation for the contig rather than invoking a two-species (LGT-based) explanation.

WAAFLE is highly sensitive and specific

We evaluated WAAFLE on synthetic contigs with prespecified synteny patterns. Synthetic contigs were always assembled from individual genes drawn from a pair of genomes, A and B. When A and B represent two different species, the contig is considered a positive LGT for TPR calculation (the “level” of the LGT is given by the level of the LCA for A and B, with intra-genus LGTs being the lowest level). When A and B represent two strains of the same species, the contig is considered a negative for FPR calculation.

Even as fractions of the underlying species database were held out (from 0 to 20%), WAAFLE tended to remain >60% specific for LGTs at the family level or higher and >99% specific at all levels of taxonomic resolution.

LGT in the human microbiome

We are applying WAAFLE to quantify rates of LGT in the human microbiome using the HMP1-II dataset. Here, we report the LGT rates and assembly sizes at eight human body sites as sampled from at least 20 healthy adults. This analysis conservatively only counts LGTs with i) known directionality, ii) an LCA above the genus level, and iii) genus-level resolution or better.



WAAFLE databases (publication versions)

Synthetic validation data

  • Synthetic data (contigs and annotations) used in the evaluation of WAAFLE:

HMP1-II contigs and LGT profiles

HMP2 contigs and LGT profiles

    • MEGAHIT assemblies of first-visit stool metagenomes from 26 HMP2 control subjects (contig sets individually compressed within tar file):
    • WAAFLE profiles of above HMP2 assemblies, post-quality control: