FUGAsseM

The Huttenhower Lab > FUGAsseM
FUGAsseM

FUGAsseM (Function predictor of Uncharacterized Gene products by Assessing high-dimensional community data in Microbiome) is a computational tool based on a “guilt by association” approach to predict functions of novel gene products in the context of microbial communities. It uses machine learning methods to predict functions of microbial proteins by integrating multiple types of community-wide data.

User manual || Tutorial || Forum

Citing FUGAsseM:

A manuscript describing FUGAsseM is currently in preparation:

Yancong Zhang, Amrisha Bhosle, Sena Bae, Kelly Eckenrode, Xueying (Sonia) Huang, Jingjing Tang, Danylo Lavrentovich, Lana Awad, Ji Hua, Xochitl C. Morgan, Andy Krueger, Wendy S. Garrett, Eric A. Franzosa*, Curtis Huttenhower*. "Predicting functions of uncharacterized gene products in microbiomes" [In preparation].

In the meantime, please add the software link in your Methods if you cite FUGAsseM:

http://199.94.60.28/fugassem

For more detailed information about the software, read FUGAsseM User Manual and FUGAsseM Tutorial

Overview

FUGAsseM uses a “guilt-by-association” approach by building an individual classifier for upweight individual data type resulting in an evidence weighting behavior in the first layer, followed by a second layer building an ensemble classifier by integrating the weighted learning results from the first layer. This layered learning and predicting process integrates different source of functional information while simultaneously assigning weights to each data type for final predictions.

Installation

Requirements

  1. Python (version >= 3.7, requiring numpy, pandas
    multiprocessing, sklearn, matplotlib, scipy, goatools, statistics python packages; tested 3.7)
  2. AnADAMA2 (version >= 0.8.0; tested 0.8.0)

Install FUGAsseM

You only need to do any one of the following options to install the FUGAsseM package.

Option 1: Installing with conda

  • $ conda install -c biobakery fugassem

Option 2: Installing with pip

  • $ pip install fugassem
  • If you do not have write permissions to /usr/lib/, then add the option --user to the install command. This will install the python package into subdirectories of ~/.local/. Please note when using the --user install option on some platforms, you might need to add ~/.local/bin/ to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message fugassem: command not found when trying to run FUGAsseM after installing with the --user option.
How to run

A typical process runs FUGAsseM per dataset.

  • For a list of command line options, run:

    $ fugassem --help

    This command yields:

    usage: fugassem_workflow.py [-h] [--version]
                        [--taxon-level {MSP,Terminal,Species,Genus,Family,
                        Order,Class,Phylum}]
            ...
  • Run the canonical function prediction of FUGAsseM, which requires MTX abundance data and annotation of protein families for function prediction.

    • Input files:

    • Demo run of FUGAsseM-MTX model

      $ fugassem --basename $BASENAME \ 
      --input $INPUT_MTX \ 
      --input-annotation $INPUT_annotation \
      --output $OUTPUT_DIR
      • $INPUT_MTX = the protein families MTX abundances file (TSV format)
      • $INPUT_annotation = raw GO annotations for some of these protein families (TSV format)
      • $OUTPUT_DIR = the output folder
      • Output files will be created named with $BASENAME:
        1. $OUTPUT_DIR/merged/$BASENAME.finalized_ML.prediction.tsv: this file combines the finalized predictions from all taxa by using machine learning approaches based on MTX coexpression patterns (TSV format).
        2. Predictions files of each taxon will also be created:
          • The finalized prediction results using MTX-coexpression evidence per taxon are in the file: $OUTPUT_DIR/main/$TAXON_NAME/prediction/finalized/$BASENAME.$TAXON_NAME.finalized_ML.prediction.tsv.
  • Run the integrated function prediction workflow of FUGAsseM. When other community-wide data are available, FUGAsseM can predict functions by integrating multiple pieces of evidence. The additional steps in this workflow are 1) building individual machine learning classifiers for each type of evidence including coexpression as discussed above, 2) and integration to generate an ensemble classifier for final function prediction.

    • Input files

    • Demo run of FUGAsseM-full model

      $ fugassem --basename $BASENAME \
      --input $INPUT_MTX \ 
      --input-annotation $INPUT_annotation \ 
      --vector-list $VECTOR_list --matrix-list $METRIX_list \ 
      --output $OUTPUT_DIR
      • $INPUT_MTX = the protein families MTX abundances file (TSV format)
      • $INPUT_annotation = raw GO annotations for some of these protein families (TSV format)
      • $VECTOR_list = file names of vector-based evidence data, provided as a string of 'file1,file2', semi-colon delimited for multiple files.
      • $METRIX_list = file names of matrix-based evidence data, provided as a string of 'file1,file2', semi-colon delimited for multiple files.
      • $OUTPUT_DIR = the output folder
      • Output files will be created named with $BASENAME:
        1. $OUTPUT_DIR/merged/$BASENAME.finalized_ML.prediction.tsv: this file combines the finalized predictions from all taxa by using machine learning approaches based on MTX coexpression patterns (TSV format).
        2. $OUTPUT_DIR/merged/$BASENAME.$EVIDENCE_TYPE_ML.prediction.tsv (where $EVIDENCE_TYPE = the basename of each piece of evidence): this file includes combined predictions based on individual type of evidence (TSV format file).
        3. Predictions files of each taxon will also be created:
          • FUGAsseM predicts functions based on input evidence data.
          • The finalized prediction results using integrated evidence per taxon are in the file: $OUTPUT_DIR/main/$TAXON_NAME/prediction/finalized/$BASENAME.$TAXON_NAME.finalized_ML.prediction.tsv.
          • The prediction results by using individual evidence per taxon are in the file: $OUTPUT_DIR/$TAXON_NAME/prediction/$EVIDENCE_TYPE/$BASENAME.$TAXON_NAME.$EVIDENCE_TYPE_ML.prediction.tsv (where $EVIDENCE_TYPE = the basename of each piece of evidence).