My research focuses on computational biology at the intersection of microbial community function and human health. The human body carries some four pounds of microbes, primarily in the gut, and understanding their biomolecular functions, their impact on human hosts, and the metabolic and functional roles of microbial communities generally is one of the key areas of study enabled by high-throughput sequencing. First, computational methods are needed to advance functional metagenomics. How can we understand what a microbial community is doing, what small molecule metabolites or signaling mechanisms it's employing, and how its function relates to its organismal composition? Second, our understanding of the human microbiome and its relationship with public health remains limited. Pathogens have been examined by centuries of microbiology and epidemiology, but we know relatively little about the transmission or heritability of the normal commensal microbiota, its carriage of pathogenic functionality, or its interaction with host immunity, environment, and genetics. Finally, more broadly, novel machine learning methodology is needed to leverage structured biological knowledge in high-dimensional genomic data analysis. In other words, we need to turn all of these new data into biology! My group works on a variety of computational methods for data mining in microbial communities, model organisms, pathogens, and the human genome.

You can explore the computational tools we have created through our Galaxy webserver and read more about them in our tutorials.

Areas of interest

Microbiome Research in the Boston AreaMy group includes a range of expertise on genomic data, biological network analysis, and bioinformatic methodology. We're interested both in developing new computational methods, particularly for microbial community analysis, and in applying these to the study of metagenomics, individual microbes and pathogens, and the human genome. Whenever possible, all of our studies are accompanied by tools to make the resulting methodology available to the community. Many of these goals are shared by other members of the Biological Sciences in Public Health program, the Broad Institute, and the MiRiBA consortium.

The human microbiome

The average human body contains some ten trillion cells, yet carries over ten times this many microbes in the gut, mouth, and on the surface of the skin - making all of us, remarkably, composite organisms that are only 10% "human". As a member of the Human Microbiome Project, I'm particularly excited about characterizing function and metabolism in these host-associated communities, a problem we've begun to solve with the HUMAnN method for metagenomic functional reconstruction. This has allowed us to look at niche specialization among microbes resident at different body sites (there's a lot), functional variation among different human hosts (not nearly as much), and the influences of host diet and environment over time (ongoing). These are all questions that arise even in the healthy human microbiome in the absence of overt disease, the answers to which will help us to develop microbial predictors of disease progression and to understand how to target the microbiome therapeutically using probiotics or pharmaceuticals.

Basic biology of model microbes and pathogens

S. cerevisiae is one of the best-studied, simplest model organisms, yet ~15% of its proteome remains uncharacterized. Needless to say, our understanding of biomolecular roles and protein function in experimentally intractable pathogens and in uncultured environmental microbial isolates lags considerably. Up to 75% of the human gut metagenome is uncharacterized, as is ~2/3 of the malaria parasite P. falciparum's genome. We are actively developing methodology for high-throughput microbial characterization, and applying it to pathogens including P. falciparum in collaboration with Matt Marti and to M. tuberculosis with Sarah Fortune. The eventual goal of this research is to computationally share information among functional experimental results from a wide variety of model microbes, using them to predict function in individual pathogens and in entire communities.

Gastrointestinal disease and cancer

The oral and gut microbiota are among the most populous and diverse niches of the human body, and both are implicated in conditions including periodontitis, ulcerative colitis, Crohn's disease, and colorectal cancer. We have ongoing projects to characterize the basic microbiology of the GI microbiota with collaborators Jacques Izard and Katherine Lemon, to link microbial function with host genetics in inflammatory bowel disease with Ramnik Xavier, to assess dysbioses in colorectal cancer with Shuji Ogino and Matthew Meyerson, and to investigate the detailed molecular mechanisms of host/microbial immune interaction with Wendy Garrett. My long-term hope is to see these studies link each of these diseases with specific dysbioses that can be used as early detection biomarkers or as targets for probiotics or pharmaceuticals.

Human functional genomics

My group has developed and implemented a probabilistic system integrating ~30,000 publicly available experimental results (>100GB of data) to predict protein function, functional relationships, cross-talk among pathways and processes, and disease involvement in human beings. This involved new methods for the exploration and statistical analysis of large, dense, weighted graphs, in addition to solving the machine learning and software engineering challenges of efficiently processing this amount of data. The resulting process of functional mapping can be applied in a variety of biological settings to direct experimenters to under-annotated functional areas or to discover functional similarities among genomic datasets. A system incorporating these algorithms, called HEFalMp (Human Experimental/Functional Mapper), provides a web site through which biologists can query genes, processes, and diseases of specific interest. In collaboration with Hilary Coller, we have confirmed the resulting predicted involvements of several proteins in the process of autophagy in human fibroblasts.

Software for functional genomics and metagenomics

We strongly believe that bioinformatic research is most useful when it results in a tool that can be freely applied to new data and experiments. To that end, almost all of our publications are accompanied by online and/or open source software as linked above, many relying on the Galaxy web application framework or the SCons tool to enable reproducible research. In collaboration with Olga Troyanskaya, my lab has also developed and documented the Sleipnir library for functional genomics, currently the only public software for efficient manipulation and machine learning from very large collections of genomic data. This C++ library includes both basic utilities (parallelization, database management, generative and discriminative machine learning, etc.) and biological modeling (representations of biological networks, genes, gene sets, interactions, functional catalogs, expression data, etc.) with a focus on integrating and learning from large, diverse biological datasets.

Completed projects

Linear models of gene expression

I developed a statistical linear model that describes the S. cerevisiae transcriptional response to changes in cellular growth rate, in collaboration with Edo Airoldi (Harvard Statistics) and David Botstein (Princeton Mol. Bio.) In addition to describing which portions of the genome are regulated with respect to growth rate, this model can be applied to new microarray data to predict the growth rate of the originating culture. This allows the inference of growth rates at instantaneous time scales not measurable by standard experimental techniques, and the model is robust to changes in growth conditions, microarray platform, and organism, as I have also successfully applied it to S. bayanus and Schz. pombe. I am currently working with Maitreya Dunham (U. Washington Genome Sci.) to create a more sophisticated computational model to capture the changes in gene regulation induced by aneuploidy.

Integrating computation and experimentation

One of the major opportunities in bioinformatics is the closer integration of functional predictions with rigorous experimental follow-up; many computational predictions are made, but only a small fraction of them are definitively confirmed in the laboratory. In collaboration with Amy Caudy (Princeton Lewis-Sigler), Chad Myers (U. Minnesota Comp. Sci.), Matthew Hibbs (Jackson Labs), David Hess (Princeton Lewis-Sigler), and others, we have integrated a collection of computational function prediction methods and experimentally verified the involvement of nearly 100 new proteins in the process of mitochondrial inheritance in yeast. My group specifically performed a study of the implications these results have for the field of computational protein function prediction. Based on the success of this study and the reliability with which we applied computational results to laboratory investigations, I am eager to see this work applied in other organisms and biological areas through targeted experimental collaborations.