Research Progress & Plans for Gerstein Group (January 2001)
Bioinformatics: Integrative Analyses of Genome Sequences, Macromolecular
Structures, and Expression Datasets
As we move into a new century, the human genome
and the genomes of a number of other organisms, comprising billions of
basepairs, have been completely or almost completely sequenced. The number
of known structures of protein domains, which provide the primary way to
interpret gene sequences in physico-chemical terms, is more than 25,000.
Moreover, for each gene in the genome, a variety of functional genomics
experiments, such as DNA microarrays, two-hybrid and gene-disruption experiments,
provide large amounts of standardized information on expression, interaction
and function. Remarkably, with the simultaneous advance of computer technology,
all of this information can fit easily onto the hard drive of a laptop
computer. However, interpreting it will require new approaches.
Broadly, the goal of our laboratory is to take advantage
of this data deluge, by performing integrative surveys and systematic data
mining on the expanding amount of biological information. It is hoped that
these analyses will allow us to address a number of overall statistical
questions about the properties and structures of macromolecules, their
functions in the cell, and their differential distribution in different
organisms. More specifically, since the lab was set up, we have made several
contributions related to comparative genomics (analyzing the occurrence
of protein folds and families in recently sequenced genomes), macromolecular
motions (developing a database framework for classifying motions in terms
of packing), and expression analysis (relating gene expression to protein
subcellular localization). The ongoing research program in the lab extends
and expands previous work as described below.
Our work is fundamentally data-driven and very different
in conception from previous computational work related to macromolecules,
which often concentrated on describing the physical process of protein
folding or predicting the 3D structure of a protein given its amino-acid
sequence. Furthermore, our work, which loosely falls into the emerging
discipline of bioinformatics, is interdisciplinary in character, combining
questions drawn from biology and chemistry with quantitative approaches
from computer science and statistics.
A. Analysis of Structures
1. The Finite Parts List of Protein Folds
The starting point for our analysis is the concept
of the finite list of molecular biological parts. One of the most direct
ways of appreciating this concept can be illustrated using protein folds:
It is believed that there is a large but limited number of protein folds
(estimated to be about 1000), and a library of these folds represents a
sort of master parts list for biology, listing all the important molecular
building blocks in an organized fashion. To build such a library, one needs
some statistical or heuristic definition of what a fold is, a way of clustering
together all the structures with a given fold, and intelligent techniques
for matching up sequences with unknown structure to those with known structure.
We are working on a number of these topics -- on the one hand, trying to
build up our own classifications of structures and, on the other hand,
simultaneously trying to integrate structural classifications developed
by others (especially scop) in our own analyses. In particular, we have
developed a way to automatically create multiple alignments of protein
structures and fuse them into consensus “fold templates”. We have also
worked extensively on developing scoring schemes for assessing the statistical
significance of a given structure-to-structure match.
2. Classification of the Flexibility of a Fold in a Web Database
One important aspect of the fold library is its
use in comprehensively surveying protein flexibility and conformational
variability -- learning how much each part in the master parts list can
vary in shape. (Variability occurs when two structures in the library share
the same fold but still have substantial conformational differences, such
as the disposition of an active site loop.)
We are arranging all instances of conformational
variability into a web-accessible database. Part of this project involves
devising a system for characterizing a protein motion in a highly standardized
fashion, and we have developed a web server that, given two coordinate
sets, automatically does this (producing “morph movies” as a by-product).
Our classification of motions is based on packing. Motions are identified
as shear or hinge, based on whether or not a well-packed interface is maintained
between the mobile elements throughout the motion.
3. Measurement of Packing with Volume and Surface Calculations
This scheme is motivated
by the fact that protein interiors are packed exceedingly tightly, and
this tight packing at most internal interfaces greatly constrains the way
proteins can move. Our past research has involved measuring the packing
efficiency at a few different interfaces (e.g. interdomain, protein surface)
using specialized geometric constructions, known as Voronoi polyhedra and
Delauney triangulations, in conjunction with limited amounts of molecular
simulation. We recently have developed a new parameter set for these calculations
which we hope may be of general use. This parameter set includes a self-consistent
set of VDW radii and standard volumes for each type of protein atom. Future
plans include developing software tools to automatically and rapidly scan
the packing at a great variety of interfaces (e.g. helix-helix, interdomain,
sheet-helix) and using these to enlarge the motions database and associated
classification. We also hope to integrate membrane proteins and nucleic
acids more fully into our packing and motions calculations.
B. Analysis of Sequences, Focusing on Comparative Genomics
1. Global Surveys of the Occurrence of Folds and Other “Protein Parts”
in Genomes
As whole genomes are sequenced and more structures
are determined, it will become possible to characterize a substantial fraction
of the folds used in a given organism -- statistically, in the sense of
a population census. This will allow us to see whether certain folds are
more common in certain organisms than in others. We have carried out a
number of initial surveys and found that a number of folds, such as TIM-barrels,
recur often in every (analyzed) genome, while other folds are missing from
certain genomes. Our analyses have also found many global, statistical
differences between protein folds from different phylogenetic groups --
e.g. longer and more numerous all-beta proteins from eukaryotes than in
prokaryotes.
More broadly, the idea of surveying “parts in genomes”
can be readily extended to other elements besides folds, such as orthologs,
sequence families, and motifs, and even to protein features, such as amino
acid composition. We are doing large-scale surveys involving these elements.
In particular, we are working on constructing whole-genome phylogenies,
with the distances between organisms defined in terms of the presence or
absence of specific "parts" in the whole genome. This is in contrast to
traditional phylogenies, which cluster organisms solely on the sequence
similarity of individual genes.
2. Global Surveys of Pseudogenes
In addition to analyzing the occurrence of parts
and features within the living genes of an organism, for large eukaryotic
genomes, we can also survey them in the pseudogenes and other "dead" regions.
This allows us to determine the common "pseudofolds" in a genome and to
address important evolutionary questions about the type of proteins that
were present in the past history of an organism.
3. Relating Folds and other Parts to Functions
We have also related folds to functions in our
genome surveys, looking at whether the distribution of functions or the
combined distribution of folds and functions differs between genomes. Preliminary
results, for instance, show that the association of enzymatic functions
with alpha/beta folds appears to be fairly universal but the worm has proportionately
more non-enzymatic "small folds" than do many microorganisms. To do this,
we have had to both classify protein functions ourselves and also to merge
many of the existing functional classifications schemes together (e.g.
those for E. coli and yeast). In the future, we are keen to integrate
the analysis of biochemical pathways, looking at the presence or absence
of whole functional systems in particular genomes.
4. Assessing Functional and Structural Annotation Transfer
Because of its size and complexity, manual annotation
of every gene in the human genome is not possible and reliable systems
have to be devised for automatic annotation. One of the most useful (and
abused) techniques in genome annotation is "annotation transfer", carrying
over information related to a variety of properties (e.g. structure and
function) from a known sequence to an unknown one that is similar to it.
We are using manually built classifications of protein folds and functions
to provide benchmarks to measure to what degree structural and functional
annotation can be reliably transferred between similar sequences, particularly
when similarity is expressed in modern probabilistic language. Insights
from these experiments can be extrapolated to provide confidence levels
for various types of genome annotation and to give us a sense of how long
it will take to structurally and functionally characterize a whole proteome.
5. Identification of Unique Pathogen Features, Particularly Membrane Proteins
Eventually, we hope to extend the work above to
be able to rapidly identify the most atypical and unique features of pathogens,
such as M. tuberculosis and T. pallidum, in relation to higher
eukaryotes. Proteins with unique folds and functions (i.e. in pathogens
and not in humans) could provide promising targets for antibiotics. Of
particular interest are membrane proteins. These make up an appreciable
fraction of each genome (estimated at around one quarter) and provide the
most direct interface between an organism and its surroundings. Although
there is only a relatively small amount of crystallographic information
on these proteins, computational methods have been more successful at identifying
transmembrane folds than those of soluble proteins. We are working on developing
methods for the prediction and characterization of membrane proteins.
C Integrative Data mining on Functional Genomics Information
1. Clustering Expression Data and relating it to Protein Features
The data from functional genomics experiments provides
dynamic information that goes considerably beyond that implicit in the
genome sequence. One of the most common sources of functional genomics
information is that related to expression (from such from such techniques
as SAGE and cDNA microarrays). We are working on clustering the expression
patterns in these experiments and cross-referencing the results against
various classifications of protein features, particularly functions. Analyzing
expression information with broad proteomic categories simplifies it and
averages away some of the noise in the data. In particular, we have developed
a way of assessing the degree to which a given clustering of genes based
on expression data globally correlates with a particular functional grouping
of genes. We have also worked on characterizing the most highly expressed
"parts" in a genome, where, as usual, a part can be defined in multiple
ways in terms of folds or families. For expression data sets that have
a timecourse aspect we can also look at the parts that change most over
the data series. Finally, we have also analyzed the relationship between
expression information and protein-protein interactions.
2. Using Expression Information to Help Predict Subcellular Localization
One feature that we found to be highly related
to expression was subcellular localization of a given gene product (e.g.
in the cytosol, nucleus, mitochondria, etc.). That is, lowly expressed
proteins were much more likely to be destined for the nucleus or mitochondria
than the cytoplasm. We used then this observation to help predict subcellular
localization of the many uncharacterized proteins. We developed a Bayesian
system that seamlessly integrated the expression information with traditional
sequence-motif clues in a probabilistic framework.
3. Integrated Large-Genome Databases and Data mining tools
Integrating the vast and hetrogeneous information
from functional genomics experiments into organized databases and analysis
systems presents both some valuable opportunities and serious technical
challenges. To this end, we are working on various approaches. For example,
we are developing ways of merging together disparate functional genomics
data (e.g. transposon insertions and expression information) to help decide
whether a particular marginal candidate for a gene is, in fact, real.
We are also working to develop ways of graphically
representing on the web results from surveying all the parts in many genome
from a variety of different perspectives. We expect that such "global views”
will prove useful in selecting and ordering targets for genome-wide experiments
and subsequently interpreting the results. We are actively trying to put
these ideas into practice through our participation in one of the NIH-funded
structural genomics centers and through collaborations with a number of
experimental functional genomicists. We are, furthermore, working on developing
interactive databases to help track the detailed laboratory results of
large-scale experimentation and data mining approaches that use the information
in these databases to help optimize the conditions for high-throughput
proteomics.
Conclusion:
Global Views of Genomes in terms of the Finite Parts List of Folds
Overall, the unifying principle for our research
is the idea of there being a large but finite list of biological parts.
We believe this list is most simply understood in terms of protein folds
(keeping in mind that there are related perspectives that focus on the
limited numbers of protein functions and pathways). Consequently, we are
helping to organize the many structures into fold families and trying to
probe the variability of each of these families through motion classification.
On the genomics side, the parts list idea is the centerpiece of our strategy
of comparing genomes in terms of a small number of building blocks seen
from many different angles. Furthermore, we hope to use categories based
on protein parts and features as a way of integrating the results from
diverse functional genomics experiments into a common framework.
Our lab was one of the first to work on comparing
genomes in terms of protein folds, and we believe that our combination
of genome analysis with more traditional biophysical and structural calculations
gives us a broad and also unique perspective on questions in computational
biology.
-- Mark.Gerstein@yale.edu (8 January 2001)