MetaWIBELE

The Huttenhower Lab > MetaWIBELE
MetaWIBELE

MetaWIBELE (Workflow to Identify novel Bioactive Elements in the microbiome) is a workflow to efficiently and systematically identify and prioritize potentially bioactive (and often uncharacterized) gene products in microbial communities. It prioritizes candidate gene products from assembled metagenomes using a combination of sequence homology, secondary-structure-based functional annotations, phylogenetic binning, ecological distribution, and association with environmental parameters or phenotypes to target candidate bioactives.

User manual || Tutorial || Forum

Citation:
Yancong Zhang, Amrisha Bhosle, Sena Bae, Lauren McIver, Emma Accorsi, Kelsey N. Thompson, Cesar Arze, Lea Wang, Damian R. Plichita, Gholamali Rahnavard, Afrah Shafquat, Ayshwarya Subramanian, Ramnik J. Xavier, Hera Vlamakis, Wendy S. Garrett, Andy Krueger, Curtis Huttenhower*, Eric A. Franzosa*. "Identifying Novel Bioactive Microbial Gene Products in Inflammatory Bowel Disease." [in preparation]

And feel free to link to MetaWIBELE in your Methods: http://huttenhower.sph.harvard.edu/metawibele

Overview

Installation

Download MetaWIBELE

You can download the latest MetaWIBELE release or the development version. The source contains example files.

Option 1: Latest Release (Recommended)

Option 2: Development Version

  • Create a clone of the Git repository:
    • $ git clone https://github.com/biobakery/metawibele.git
  • You can always update to the latest version of the repository with:
    • $ git pull --update

Install MetaWIBELE

You only need to do any one of the following options to install the MetaWIBELE package.

Option 1: Installing with pip

  • $ pip install metawibele
  • If you do not have write permissions to "/usr/lib/", then add the option "--user" to the install command. This will install the python package into subdirectories of "~/.local/". Please note when using the "--user" install option on some platforms, you might need to add "~/.local/bin/" to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message "metawibele: command not found" when trying to run MetaWIBELE after installing with the "--user" option.

Option 2: Installing with docker

  • $ docker pull biobakery/metawibele
  • This docker image includes most of the dependent software packages.
  • Large software packages and those with licenses are not included in this image:
    • Softwares with the license: mspimer, signalp, TMHMM, phobius, psortb
    • Softwares with large size: interproscan
    • Users should review the license terms and install these packages manually.

Option 3: Installing with conda

  • $ conda install -c biobakery metawibele

Option 4: Installing from source

  • Move to the MetaWIBELE directory:
    • $ cd $MetaWIBELE_PATH
  • Install MetaWIBELE package:
    • $ python setup.py install
    • If you do not have write permissions to "/usr/lib/", then add the option "--user" to the install command. This will install the python package into subdirectories of "~/.local/". Please note when using the "--user" install option on some platforms, you might need to add "~/.local/bin/" to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message "metawibele: command not found" when trying to run MetaWIBELE after installing with the "--user" option.

Install databases

To run metawibele, you are required to install the dependent UniRef databases (both sequences and annotations). We have built these databases based on UniProt/UniRef90 2019_01 sequences and annotations. You can use any one of the following options to install these databases.

Option 1: Download and uncompress them and provide $UNIREF_LOCATION as the location to install the databases:

Option 2: Run the following commands to install the databases into the location $UNIREF_LOCATION:

  • download UniRef90 sequences indexed by Diamond v0.9.24 into $UNIREF_LOCATION:
    $ metawibele_download_database --database uniref --build uniref90_diamond --install-location $UNIREF_LOCATION
  • download UniRef90 annotation files into $UNIREF_LOCATION:
    $ metawibele_download_database --database uniref --build uniref90_annotation --install-location $UNIREF_LOCATION

Download configuration file

To run MetaWIBELE, you are required to customize the global configuration file "metawibele.cfg" and make sure that it's in your working directory. You can use one of the following options to get the configuration template.

Option 1: Obtain copies by right-clicking the link and selecting "save link as": metawibele.cfg

Option 2: Run this command to download the global configuration file:

  • $ metawibele_download_config --config-type global
How to run

A typical process runs MetaWIBELE-characterize workflow and then MetaWIBELE-prioritize workflow per dataset.

  • For a list of all available workflows, run:

    $ metawibele --help

    This command yields:

    usage: metawibele [-h] [--version] {characterize,prioritize,preprocess}
    
    MetaWIBELE workflows: A collection of AnADAMA2 workflows
    
    positional arguments:
    {characterize,prioritize,preprocess}    workflow to run
    
    optional arguments:
    -h, --help            show this help message and exit
     --version             show program's version number and exit
  • All workflows follow the general command format:

    $ metawibele $WORKFLOW

  • For specific options of workflow, run:

    $ metawibele $WORKFLOW --help

    For example: $ metawibele characterize --help

    This command yields:

    usage: characterize.py [-h] [--version] [--threads THREADS]
                       [--characterization-config   CHARACTERIZATION_CONFIG]
                       [--mspminer-config MSPMINER_CONFIG]
                       [--bypass-clustering] [--bypass-global-homology]
                       [--bypass-domain-motif] [--bypass-interproscan]
                       [--bypass-pfamtogo] [--bypass-domine] [--bypass-sifts]
                       [--bypass-expatlas] [--bypass-psortb]
                       [--bypass-abundance] [--split-number SPLIT_NUMBER]
                       [--bypass-integration] [--study STUDY]
                       [--basename BASENAME] --input-sequence INPUT_SEQUENCE
                       --input-count INPUT_COUNT
                       [--input-metadata INPUT_METADATA] [--output OUTPUT]
                       [-i INPUT] [--config CONFIG] [--local-jobs JOBS]
                       [--grid-jobs GRID_JOBS] [--grid GRID]
                       [--grid-partition GRID_PARTITION]
                       [--grid-benchmark {on,off}]
                       [--grid-options GRID_OPTIONS]
                       [--grid-environment GRID_ENVIRONMENT]
                       [--grid-scratch GRID_SCRATCH] [--dry-run]
                       [--skip-nothing] [--quit-early]
                       [--until-task UNTIL_TASK] [--exclude-task EXCLUDE_TASK]
                       [--target TARGET] [--exclude-target EXCLUDE_TARGET]
                       [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
    
    A workflow for MetaWIBELE characterization
  • Run MetaWIBELE-characterize workflow, which uses gene families (non-redundant gene catalogs) to build protein families and annotate them functionally and taxonomically.
    • Input files for for characterization
    • Demo run of MetaWIBELE-characterize

      $ metawibele characterize --input-sequence $INPUT_SEQUENCE --input-count $INPUT_COUNT --input-metadata $INPUT_METADATA --output $OUTPUT_DIR

      • Make sure the global configuration file "metawibele.cfg" is in your working directory.
      • $INPUT_SEQUENCE = the protein sequences file for gene families (non-redundant gene catalogs)
      • $INPUT_COUNT = the count file for gene families (non-redundant gene catalog)
      • $INPUT_METADATA = the metadata file
      • $OUTPUT_DIR = the output folder
      • Four main output files will be created where $BASENAME is the basename of output files provided in "metawibele.cfg":
        1. $OUTPUT_DIR/$BASENAME_proteinfamilies_annotation.tsv
        2. $OUTPUT_DIR/$BASENAME_proteinfamilies_annotation.attribute.tsv
        3. $OUTPUT_DIR/$BASENAME_proteinfamilies_annotation.taxonomy.tsv
        4. $OUTPUT_DIR/$BASENAME_proteinfamilies_nrm.tsv
      • Intermediate temp files will also be created:
        1. $OUTPUT_DIR/clustering/
          • a subfolder including the full outputs from proteins family building
        2. $OUTPUT_DIR/global_homology_annotation/
          • a subfolder including the full outputs from global-homology based annotation
        3. $OUTPUT_DIR/domain_motif_annotation/
          • a subfolder including the full outputs from domain-motif based annotation
        4. $OUTPUT_DIR/abundance_annotation/
          • a subfolder including the full outputs from abundance based annotation
  • Run MetaWIBELE-prioritize workflow, which ranks protein families (with a focus on uncharacterized, and thus particularly novel, families) by combining evidence from sample-specific feature ecology and host disease phenotypes or environmental parameters.
    • Input files for prioritization
    • Demo run of MetaWIBELE-prioritize

      $ metawibele prioritize --input-annotation $INPUT_ANNOTATION --input-attribute $INPUT_ATTRIBUTE --output $OUTPUT_DIR

      • Make sure the global configuration file "metawibele.cfg" is in your working directory.
      • $INPUT_ANNOTATION = the final annotation file produced by MetaWIBELE-characterize workflow
      • $INPUT_ATTRIBUTE = the final attribute file produced by MetaWIBELE-characterize workflow
      • $OUTPUT_DIR = the output folder
      • Three main output files will be created where $BASENAME is the basename of output files provided in "metawibele.cfg":
        1. $OUTPUT_DIR/$BASENAME_unsupervised_prioritization.rank.table.tsv
        2. $OUTPUT_DIR/$BASENAME_supervised_prioritization.rank.table.tsv
        3. $OUTPUT_DIR/$BASENAME_supervised_prioritization.rank.selected.table.tsv
  • Parallelization Options
    When running any workflow you can add the following command-line options to make use of existing computing resources:

    • --local-jobs <1> : Run multiple tasks locally in parallel. Provide the max number of tasks to run at once. The default is one task running at a time.
    • --grid-jobs <0> : Run multiple tasks on a grid in parallel. Provide the max number of grid jobs to run at once. The default is zero tasks are submitted to a grid resulting in all tasks running locally.
    • --grid <slurm> : Set the grid available on your machine. This will default to the grid found on the machine with options of slurm and sge.
    • --grid-partition <serial_requeue> : Jobs will be submitted to the partition selected. The default partition selected is based on the default grid.
    • For additional workflow options, see the AnADAMA2 user manual.
Download MetaWIBELE Results

These are pre-computed MetaWIBELE prioritizations and characterization annotations for assemblies and gene families from the Integrative Human Microbiome Project (HMP2), Inflammatory Bowel Disease Multi'omics Database (IBDMDB). Prioritization indicates predicted bioactivity in the human gut during inflammatory bowel disease. For more information, see our paper "Identifying Novel Bioactive Microbial Gene Products in Inflammatory Bowel Disease." [in preparation].

Gene families assembled from 1595 metagenomes in HMP2

Characterization of protein families

Prioritization of protein families