As we move into a new century, the human genome and the genomes of a number of other organisms, comprising billions of basepairs, have been completely or almost completely sequenced. The number of known structures of protein domains, which provide the primary way to interpret gene sequences in physico-chemical terms, is more than 25,000. Moreover, for each gene in the genome, a variety of functional genomics experiments, such as DNA microarrays, two-hybrid and gene-disruption experiments, provide large amounts of standardized information on expression, interaction and function. Remarkably, with the simultaneous advance of computer technology, all of this information can fit easily onto the hard drive of a laptop computer. However, interpreting it will require new approaches.

Broadly, the goal of our laboratory is to take advantage of this data deluge, by performing integrative surveys and systematic data mining on the expanding amount of biological information. It is hoped that these analyses will allow us to address a number of overall statistical questions about the properties and structures of macromolecules, their functions in the cell, and their differential distribution in different organisms. More specifically, since the lab was set up, we have made several contributions related to comparative genomics (analyzing the occurrence of protein folds and families in recently sequenced genomes), macromolecular motions (developing a database framework for classifying motions in terms of packing), and expression analysis (relating gene expression to protein subcellular localization). The ongoing research program in the lab extends and expands previous work as described below.

Our work is fundamentally data-driven and very different in conception from previous computational work related to macromolecules, which often concentrated on describing the physical process of protein folding or predicting the 3D structure of a protein given its amino-acid sequence. Furthermore, our work, which loosely falls into the emerging discipline of bioinformatics, is interdisciplinary in character, combining questions drawn from biology and chemistry with quantitative approaches from computer science and statistics.

A. Analysis of Structures

1. The Finite Parts List of Protein Folds

The starting point for our analysis is the concept of the finite list of molecular biological parts. One of the most direct ways of appreciating this concept can be illustrated using protein folds: It is believed that there is a large but limited number of protein folds (estimated to be about 1000), and a library of these folds represents a sort of master parts list for biology, listing all the important molecular building blocks in an organized fashion. To build such a library, one needs some statistical or heuristic definition of what a fold is, a way of clustering together all the structures with a given fold, and intelligent techniques for matching up sequences with unknown structure to those with known structure. We are working on a number of these topics -- on the one hand, trying to build up our own classifications of structures and, on the other hand, simultaneously trying to integrate structural classifications developed by others (especially scop) in our own analyses. In particular, we have developed a way to automatically create multiple alignments of protein structures and fuse them into consensus “fold templates”. We have also worked extensively on developing scoring schemes for assessing the statistical significance of a given structure-to-structure match.

2. Classification of the Flexibility of a Fold in a Web Database

One important aspect of the fold library is its use in comprehensively surveying protein flexibility and conformational variability -- learning how much each part in the master parts list can vary in shape. (Variability occurs when two structures in the library share the same fold but still have substantial conformational differences, such as the disposition of an active site loop.)

We are arranging all instances of conformational variability into a web-accessible database. Part of this project involves devising a system for characterizing a protein motion in a highly standardized fashion, and we have developed a web server that, given two coordinate sets, automatically does this (producing “morph movies” as a by-product). Our classification of motions is based on packing. Motions are identified as shear or hinge, based on whether or not a well-packed interface is maintained between the mobile elements throughout the motion.

3. Measurement of Packing with Volume and Surface Calculations

This scheme is motivated by the fact that protein interiors are packed exceedingly tightly, and this tight packing at most internal interfaces greatly constrains the way proteins can move. Our past research has involved measuring the packing efficiency at a few different interfaces (e.g. interdomain, protein surface) using specialized geometric constructions, known as Voronoi polyhedra and Delauney triangulations, in conjunction with limited amounts of molecular simulation. We recently have developed a new parameter set for these calculations which we hope may be of general use. This parameter set includes a self-consistent set of VDW radii and standard volumes for each type of protein atom. Future plans include developing software tools to automatically and rapidly scan the packing at a great variety of interfaces (e.g. helix-helix, interdomain, sheet-helix) and using these to enlarge the motions database and associated classification. We also hope to integrate membrane proteins and nucleic acids more fully into our packing and motions calculations.

B. Analysis of Sequences, Focusing on Comparative Genomics

1. Global Surveys of the Occurrence of Folds and Other “Protein Parts” in Genomes

As whole genomes are sequenced and more structures are determined, it will become possible to characterize a substantial fraction of the folds used in a given organism -- statistically, in the sense of a population census. This will allow us to see whether certain folds are more common in certain organisms than in others. We have carried out a number of initial surveys and found that a number of folds, such as TIM-barrels, recur often in every (analyzed) genome, while other folds are missing from certain genomes. Our analyses have also found many global, statistical differences between protein folds from different phylogenetic groups -- e.g. longer and more numerous all-beta proteins from eukaryotes than in prokaryotes.

More broadly, the idea of surveying “parts in genomes” can be readily extended to other elements besides folds, such as orthologs, sequence families, and motifs, and even to protein features, such as amino acid composition. We are doing large-scale surveys involving these elements. In particular, we are working on constructing whole-genome phylogenies, with the distances between organisms defined in terms of the presence or absence of specific "parts" in the whole genome. This is in contrast to traditional phylogenies, which cluster organisms solely on the sequence similarity of individual genes.

2. Global Surveys of Pseudogenes

In addition to analyzing the occurrence of parts and features within the living genes of an organism, for large eukaryotic genomes, we can also survey them in the pseudogenes and other "dead" regions. This allows us to determine the common "pseudofolds" in a genome and to address important evolutionary questions about the type of proteins that were present in the past history of an organism.

3. Relating Folds and other Parts to Functions

We have also related folds to functions in our genome surveys, looking at whether the distribution of functions or the combined distribution of folds and functions differs between genomes. Preliminary results, for instance, show that the association of enzymatic functions with alpha/beta folds appears to be fairly universal but the worm has proportionately more non-enzymatic "small folds" than do many microorganisms. To do this, we have had to both classify protein functions ourselves and also to merge many of the existing functional classifications schemes together (e.g. those for E. coli and yeast). In the future, we are keen to integrate the analysis of biochemical pathways, looking at the presence or absence of whole functional systems in particular genomes.

4. Assessing Functional and Structural Annotation Transfer

Because of its size and complexity, manual annotation of every gene in the human genome is not possible and reliable systems have to be devised for automatic annotation. One of the most useful (and abused) techniques in genome annotation is "annotation transfer", carrying over information related to a variety of properties (e.g. structure and function) from a known sequence to an unknown one that is similar to it. We are using manually built classifications of protein folds and functions to provide benchmarks to measure to what degree structural and functional annotation can be reliably transferred between similar sequences, particularly when similarity is expressed in modern probabilistic language. Insights from these experiments can be extrapolated to provide confidence levels for various types of genome annotation and to give us a sense of how long it will take to structurally and functionally characterize a whole proteome.

5. Identification of Unique Pathogen Features, Particularly Membrane Proteins

Eventually, we hope to extend the work above to be able to rapidly identify the most atypical and unique features of pathogens, such as M. tuberculosis and T. pallidum, in relation to higher eukaryotes. Proteins with unique folds and functions (i.e. in pathogens and not in humans) could provide promising targets for antibiotics. Of particular interest are membrane proteins. These make up an appreciable fraction of each genome (estimated at around one quarter) and provide the most direct interface between an organism and its surroundings. Although there is only a relatively small amount of crystallographic information on these proteins, computational methods have been more successful at identifying transmembrane folds than those of soluble proteins. We are working on developing methods for the prediction and characterization of membrane proteins.

C Integrative Data mining on Functional Genomics Information

1. Clustering Expression Data and relating it to Protein Features

The data from functional genomics experiments provides dynamic information that goes considerably beyond that implicit in the genome sequence. One of the most common sources of functional genomics information is that related to expression (from such from such techniques as SAGE and cDNA microarrays). We are working on clustering the expression patterns in these experiments and cross-referencing the results against various classifications of protein features, particularly functions. Analyzing expression information with broad proteomic categories simplifies it and averages away some of the noise in the data. In particular, we have developed a way of assessing the degree to which a given clustering of genes based on expression data globally correlates with a particular functional grouping of genes. We have also worked on characterizing the most highly expressed "parts" in a genome, where, as usual, a part can be defined in multiple ways in terms of folds or families. For expression data sets that have a timecourse aspect we can also look at the parts that change most over the data series. Finally, we have also analyzed the relationship between expression information and protein-protein interactions.

2. Using Expression Information to Help Predict Subcellular Localization

One feature that we found to be highly related to expression was subcellular localization of a given gene product (e.g. in the cytosol, nucleus, mitochondria, etc.). That is, lowly expressed proteins were much more likely to be destined for the nucleus or mitochondria than the cytoplasm. We used then this observation to help predict subcellular localization of the many uncharacterized proteins. We developed a Bayesian system that seamlessly integrated the expression information with traditional sequence-motif clues in a probabilistic framework.

3. Integrated Large-Genome Databases and Data mining tools

Integrating the vast and hetrogeneous information from functional genomics experiments into organized databases and analysis systems presents both some valuable opportunities and serious technical challenges. To this end, we are working on various approaches. For example, we are developing ways of merging together disparate functional genomics data (e.g. transposon insertions and expression information) to help decide whether a particular marginal candidate for a gene is, in fact, real.

We are also working to develop ways of graphically representing on the web results from surveying all the parts in many genome from a variety of different perspectives. We expect that such "global views” will prove useful in selecting and ordering targets for genome-wide experiments and subsequently interpreting the results. We are actively trying to put these ideas into practice through our participation in one of the NIH-funded structural genomics centers and through collaborations with a number of experimental functional genomicists. We are, furthermore, working on developing interactive databases to help track the detailed laboratory results of large-scale experimentation and data mining approaches that use the information in these databases to help optimize the conditions for high-throughput proteomics.

Conclusion:
Global Views of Genomes in terms of the Finite Parts List of Folds

Overall, the unifying principle for our research is the idea of there being a large but finite list of biological parts. We believe this list is most simply understood in terms of protein folds (keeping in mind that there are related perspectives that focus on the limited numbers of protein functions and pathways). Consequently, we are helping to organize the many structures into fold families and trying to probe the variability of each of these families through motion classification. On the genomics side, the parts list idea is the centerpiece of our strategy of comparing genomes in terms of a small number of building blocks seen from many different angles. Furthermore, we hope to use categories based on protein parts and features as a way of integrating the results from diverse functional genomics experiments into a common framework.

Our lab was one of the first to work on comparing genomes in terms of protein folds, and we believe that our combination of genome analysis with more traditional biophysical and structural calculations gives us a broad and also unique perspective on questions in computational biology.

Research Progress & Plans for Gerstein Group (January 2001)

Bioinformatics: Integrative Analyses of Genome Sequences, Macromolecular Structures, and Expression Datasets

A. Analysis of Structures

1. The Finite Parts List of Protein Folds

2. Classification of the Flexibility of a Fold in a Web Database

3. Measurement of Packing with Volume and Surface Calculations

B. Analysis of Sequences, Focusing on Comparative Genomics

1. Global Surveys of the Occurrence of Folds and Other “Protein Parts” in Genomes

2. Global Surveys of Pseudogenes

3. Relating Folds and other Parts to Functions

4. Assessing Functional and Structural Annotation Transfer

5. Identification of Unique Pathogen Features, Particularly Membrane Proteins

C Integrative Data mining on Functional Genomics Information

1. Clustering Expression Data and relating it to Protein Features

2. Using Expression Information to Help Predict Subcellular Localization

3. Integrated Large-Genome Databases and Data mining tools

Conclusion:
Global Views of Genomes in terms of the Finite Parts List of Folds

Research Progress & Plans for Gerstein Group (January 2001)

Bioinformatics: Integrative Analyses of Genome Sequences, Macromolecular Structures, and Expression Datasets

A. Analysis of Structures

1. The Finite Parts List of Protein Folds

2. Classification of the Flexibility of a Fold in a Web Database

3. Measurement of Packing with Volume and Surface Calculations

B. Analysis of Sequences, Focusing on Comparative Genomics

1. Global Surveys of the Occurrence of Folds and Other “Protein Parts” in Genomes

2. Global Surveys of Pseudogenes

3. Relating Folds and other Parts to Functions

4. Assessing Functional and Structural Annotation Transfer

5. Identification of Unique Pathogen Features, Particularly Membrane Proteins

C Integrative Data mining on Functional Genomics Information

1. Clustering Expression Data and relating it to Protein Features

2. Using Expression Information to Help Predict Subcellular Localization

3. Integrated Large-Genome Databases and Data mining tools

Conclusion: Global Views of Genomes in terms of the Finite Parts List of Folds

Conclusion:
Global Views of Genomes in terms of the Finite Parts List of Folds