The last decade has seen an explosion in high-throughput technologies, making it relatively straightforward to generate biomedical datasets of billions of data points in a short time. Such datasets often require more time to be analyzed than to be generated. Since large-scale datasets will only become larger and more common in the future, it is essential for biomedical researchers today to possess the skills to manipulate and analyze them and for new computational and statistical methodology to be developed to perform these analyses.
This tutorial will survey statistical methods and software to analyze diverse experimental data from a network point of view. We will focus on gene expression data and single gene perturbation screens, describing approaches to make the step from the parts list to the wiring diagram by using phenotypes for network inference and integrating them with complementary data sources. We will discuss scaling these techniques to very large data compendia (e.g. thousands of expression arrays) and statistical methods for mining the resulting inferred biological networks.
Upon completion of this course, participants should be able to:
- Perform detailed analyses of single microarray experiments using R/Bioconductor.
- Construct nested effects and other statistical models from single gene perturbation screens.
- Integrate large biological data collections, including hundreds or thousands of gene expression conditions or other experimental results.
- Perform basic network mining and functional mapping to interpret the resulting biological networks.
Content and instructional methods
Participants will be made familiar with statistical challenges appearing in expression data analysis, gene perturbation screens, large scale data integration, and biological network mining. Algorithms will be introduced and discussed shortly and we will give pointers to available implementations and software. The course will aim for breadth rather than depth, providing links to additional information where possible, and will focus on concepts that can be applied in the participants' own biomedical or statistical research.
The following is a list of topics covered in the course. Topics will include a mix of theory, examples, data analysis, and brief computing demonstrations.
- Introduction to gene expression microarrays;
- Introduction to expression data analysis in R/Bioconductor;
- Gene perturbation screens;
- Nested effects models;
- Introduction to experimental and functional data types/sources;
- Biological network models;
- Biological network and data integration;
- Functional mapping and large scale network mining.
Curtis Huttenhower is an Assistant Professor in the Biostatistics Department at the Harvard School of Public Health in Boston, MA. Dr. Huttenhower's methodological research focuses on large-scale data mining and network models, with biological applications in microbial population and metagenome characterization. Dr. Huttenhower's collaborations include experimentalists from the Human Microbiome Project, Nurse's Health Study, and Health Professionals Followup Study, providing a diversity of practical large-scale genomic data.
Florian Markowetz is a Group Leader at the Cancer Research UK Cambridge Research Institute in Cambridge, UK. Dr. Markowetz has extensive experience in analyzing gene perturbation screens and developing novel methodology to address the unique features of gene perturbation data. Dr. Markowetz works in several close collaborations to analyze gene perturbation screens, which have given him a wide overview of which challenges arise in real-world studies and how to address them computationally.