MBG

AL Williams
Assoc Prof
BioMedInfo
MBB
CSS


 
           
Harvard
AB 1989
Cambridge
PhD 1993
Stanford
1993-1996
Yale Faculty
1997



Research Summary: Protein Bioinformatics
    As the 21st century unfolds, the biological sciences are being transformed by the advent of large-scale data. The sequencing of the human genome is a dramatic example of this. Simultaneous to this increase in biological data, computers and computation have had a transformative effect on the way information is handled, stored, and mined. These computational advances apply, of course, to many facets of life. The goal of my lab is to connect these two developments: harnessing computational advances for the analysis of large-scale biological data, principally by performing integrative surveys and systematic data mining.

    More specifically, we are focused on protein bioinformatics: understanding the structure, function, and evolution of proteins through analyzing populations of them in databases and in whole-genome experiments. Overall we have four research foci, which follow a progression from surveying the overall genomic landscape to analyzing individual proteins and their interactions in more detail, to zooming in on the chemical structure of specific molecules.


    1 Genomics: Mining and Annotating Intergenic Regions, especially in relation to Pseudogenes

    We are involved in a number of large-scale collaborations (e.g. ENCODE) to probe the activity of intergenic regions with tiling array technology. We have developed tools to design, score and interpret these arrays and to highlight particular array artifacts. The overall conclusion from this work has been that much of the intergenic regions of the human genome appear to be active, both transcriptionally and in terms of protein binding. In connection with tiling array experiments, we have done an extensive amount of intergenic annotation, with a particular focus on mining intergenic regions for pseudogenes (protein fossils). We were, in fact, one of the first groups to perform comprehensive surveys of pseudogenes on a genome-wide scale in terms of protein families, which we did for human, worm, yeast and a number of other organisms. Collectively, our studies enable us to determine the common "pseudofolds" and "pseudofamilies" in various genomes and to address important evolutionary questions about the type of proteins that were present in the past history of an organism.


    2 Proteomics: Using Networks to Mine Functional Genomic Data and Understand Protein Function

    After the main elements of the human genome are identified, we need to characterize their function. We are trying to characterize gene function through molecular networks. We work on systematically integrating many weak functional genomic features with data mining techniques to predict protein networks (comprising protein interactions and other functional linkages). Some of the features integrated are obviously related to protein interactions (e.g. expression correlations), but many others such as gene essentiality are much less so. In addition, we have studied the structure of protein networks, both on a large scale in terms of global statistics (e.g. the diameter) and on a small scale in terms of local network motifs (e.g. hubs). In particular, we have correlated network hubs with gene essentiality. Most importantly, we extensively study the dynamics of networks. This has allowed us to show how a network dramatically changes in different conditions.


    3 Structural Genomics: Analysis of Folds, Families and Functions on a Large Scale

    Another area of research in our lab is structural genomics. Here, we conceptualize proteins not purely as character sequences or abstract network nodes, but more in terms of their molecular structure. We have examined the large-scale relationships between sequence, structure and function in order to understand the extent to which structural and functional annotation can reliably be transferred between similar sequences, particularly when similarity is expressed in modern probabilistic language. We have related the occurrence of protein folds and families to phylogeny and deep evolutionary history. Our studies enabled us to recognize that particular folds are more common in certain organisms than in others. Finally, as part of our work on structural genomics, we relate the properties of proteins with their eventual success at being purified and structurally characterized. This has been in the framework of a database and decision-tree mining framework that we have built for the NESG structural genomics consortium.


    4 Computational Biophysics: Relating Macromolecular Motions and Packing

    The final area of focus in the lab is analyzing small populations of structures in terms of their detailed 3D-geometry and physical properties. Here, we try to interpret macromolecular motions in terms of packing. We have set up a database of macromolecular motions and coupled it with simulation tools to interpolate between structural conformations; the database also has tools to predict likely motions based on simple models, such as normal modes and localized hinges connecting rigid domains. Part of this project involves devising a system for characterizing motions in a highly standardized fashion. Our motions classification scheme is motivated by the fact that protein interiors are packed exceedingly tightly, and the tight packing can greatly constrain a protein's mobility. We have developed tools for measuring and comparing the packing efficiency at different interfaces (e.g. inter-domain, protein surface, helix-helix, protein vs. RNA) using specialized geometric constructions (e.g. Voronoi polyhedra).


    Summary & Broader Societal Issues

    In summary, my lab acts a connector, bringing quantitative approaches from disciplines such as CS and applied math to bear on real questions and data in molecular biology. In particular, we have extensively applied classical computational approaches involving simulation, machine learning, and database design to biological problems. This often happens in the framework of practical, experimental collaborations, where we function as part of multi-disciplinary teams. Team participation is a key feature of the lab. Finally, as part of our mission to connect biology with computation, we have also extensively analyzed how a number of larger issues relating to computation in society impact biological research. In particular, we have examined how general aspects of e-publishing and digital libraries relate to biomedical databases and how various legal and security concerns significantly impact genomics database interoperation.



Publications
Contact
More Profile Info



Geo Visitors Map