Automated Repository Acquisition
ARepA is an acronym for Automated Repository Acquisition, and is designed as a command-line tool to easily fetch ‘omics data from multiple heterogeneous repositories and process them in a standardized way.
For more information on the technical aspects:
Daniela Börnigen*, Yo Sup Moon*, Gholamali Rahnavard, Levi Waldron, Lauren McIver, Afrah Shafquat, Eric Franzosa, Larissa Miropolsky, Christopher Sweeney, Xochitl Morgan, Wendy S. Garrett, and Curtis Huttenhower “A reproducible approach to high-throughput biological data acquisition and integration. PeerJ. 2015; 3: e791. (* contributed equally)
Its main features include, but are not limited to:
- Gene ID standardization: i.e. all output gene identifiers can be set to be translated to a single convention. Supported identifiers include Gene Symbols, UniProt, UniRef, Entrez Gene, Kegg Orthologs, and more.
- File updates on an as-needed basis: ARepA only reruns processes that are necessary to the building of a file. It keeps track of where you left off, so you don’t have to waste valuable computational resources!
- File standardization: data is saved as a tab-delimited text format, and metadata is saved as a python pickle object. For some modules, you have automatically generated R packages as a final output.
- Modular design: other repositories can be built on top of the existing architecture. ARepA can be used as an all-purpose data mining tool!
- Currently, ARepA fetches data from seven repositories: Bacteriome, RegulonDB, STRING, BioGRID, MPIDB, GEO, and IntAct.
Supported Operating Systems
ARepA was fully tested on and is supported by the following platform(s)
- Mac OS X (>= 10.7.4)
The following platform(s) is(are) NOT supported:
- Windows (>= XP)
The following platform(s) has(have) NOT been tested:
NB: It is highly recommended that ARepA is run on a server rather than on a local machine when intending to use it for performing large batch jobs. Certain processes (such as missing value imputation and functional network construction for GEO) can be CPU-intensive and is not suitable for laptop set-ups. Other processes, such as running custom pipelines to analyze the data once it has been downloaded, are both suitable and convenient to run on smaller-scale local environments
Before downloading ARepA, you should have the following software on your machine:
- Python (ver 2.7.x)
- SCons (ver >= 2.1)
- R (ver >= 2.13) with GEOquery package v3.0, arrayQualityMetrics, and affy package (both are part of Bioconductor)
- Java SE 6 (ver >= 1.6): Java is needed for gene identifier conversion service
- Apache Ant (ver >= 1.8.0)
- Subversion Source Control Management (ver >= 1.7): for automated acquisition of BridgeDB
- Sleipnir Library for Computational Functional Genomics (Optional, but necessary for some normalization steps)
Contact and Support
All questions should be directed to the arepa-users Google Group.
- Added version logging for all command line executables and scripts.
- Implemented Python 3 compatibility (in addition to Python 2.7), modulo the next release of SCons (see details below).
- Calculate and store MD5 checksums for all data files (in the associated metadata) to enable between-run and -environment data validation.
- Call arrayQualityMetrics for all expression datasets to provide an automatically generated report on data quality measures.
- April 11th, 2013 – ARepA version 0.9.7 released
This software is licensed under the MIT license.
Copyright (c) 2013 Yo Sup Moon, Daniela Boernigen, Levi Waldron, Eric Franzosa, Xochitl Morgan, and Curtis Huttenhower
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.