ShortBRED: Short, Better Representative Extract Dataset
ShortBRED is a pipeline to take a set of protein sequences, group them into families, extract a set of distinctive strings ("markers"), and then search for these markers in metagenomic data and determine the presence and abundance of the protein families of interest.
For more information on the technical aspects to this program, or to cite ShortBRED, please use the following citation:
James Kaminski, Molly K. Gibson, Eric A. Franzosa, Nicola Segata, Gautam Dantas, and Curtis Huttenhower. Fast and Accurate Metagenomic Search with ShortBRED. (In progress)
ShortBRED can be downloaded from this link: Current Version of ShortBRED
You may also install ShortBRED using Mercurial:
$ hg clone https://bitbucket.org/biobakery/shortbred
ShortBRED utilizes a number of other programs that must be installed as prerequisites. These include:
We have created a set of ShortBRED markers to identify antibiotic resistance in the human microbiome. To use these markers to profile metagenomic data, please download the version that is appropriate for your average read size of your metagenome.
Centroids for Antibiotic Resistance Protein Sequences
(These are used for profiling long read (450bp), shallow data, such as that from a 454 machine.)
ShortBRED Reference Databases
ShortBRED requires a large set of protein sequences to use a reference database. The program ShortBRED-Identify compares the input set of protein sequences to these and looks for regions of homology.
The BLAST database we created for the paper can be downloaded here .
We now recommend users download one of the UniRef fasta files of proteins, and construct a database from that in ShortBRED-Identify. This can be performed by downloading one of the UniRef files, and adding the command "--ref uniref90.fasta", for example.
Tutorial, Documentation, and Users' Group
The tutorial for ShortBRED is stored on Bitbucket.
The Bitbucket repository also contains a short overview of the software.
A users' group is also available at ShortBRED-users .
ShortBRED consists of two main scripts:
ShortBRED-Identify - This takes a FASTA file of amino acid sequences, clusters them into families, searches for overlap among the family consensus sequences and against a separate reference file of amino acid sequences, and then produces a FASTA file of markers from the unique sequence material.
ShortBRED-Quantify - This takes the FASTA file of markers and quantifies their relative abundance in a FASTA file of nucleotide metagenomic reads.
Proteins of Interest - These are the sequences one wishes to find in the metagenomic data, saved as a FASTA file of amino acid sequences. ShortBRED will cluster them into families, and then create short markers for the families.
Note: Using a database of more than 2,500 AA sequences will require several threads to run in a reasonable amount of time.
Reference Set of Proteins - ShortBRED compares your proteins of interest to this set of proteins, and eliminates short regions of overlap. ShortBRED removes these sections so that what remains is a distinctive sequence to represent the family.
Short Nucleotide Reads - This is the metagenomic dataset you wish to analyze, typically one or more FASTA files containing millions of short nucleotide reads.
To create markers for the sample data included with ShortBRED, set your current working directory to the folder where you unpacked ShortBRED and type:
$ ./shortbred_identify.py --goi example/input_prots.faa --ref example/ref_prots.faa --markers mytestmarkers.faa --tmp example_identify
The sample data included with ShortBRED is quite small, so this command should run in less than a minute on a typical machine. It will create a set of markers ("mytestmarkers.faa") that you can open up and explore to get a sense of what typical ShortBRED-Identify output looks like.
There are many settings available. Please see the documentation for more details.
If you would like to test ShortBRED-Quantify using your new markers, enter the following command:
$ ./shortbred_quantify.py --markers mytestmarkers.faa --wgs example/wgs.fna --results exampleresults.txt --tmp example_quantify
This command should also run quickly, as there are only 100 nucleotide reads in example/wgs.fna. You can then open up results.txt and see the ShortBRED counts for each protein family, which provides the relative abundance of the protein families in the wgs data.
As with ShortBRED-Identify, there are many settings available, which are described in more detail in the documentation.