Chemical & Engineering News: SCIENCE/TECHNOLOGY - For Genomics' Protein Avalanche

The transcribing of the human genome's text may not be quite as complete as was trumpeted in the press in June, but the end is clearly close enough to make scientists aware that they've got another herculean task ahead of them. The glut of genomic information in the form of 3 billion base pairs, assembled into tens of thousands of genes, will need to be translated into proteins, and the structure and function of those proteins will need to be determined. Then, finally, all that information can be used to help cure the world population's ills.

It can take years to crystallize even one protein--with the goal of determining its structure by X-ray diffraction. Figuring out its function can be even more complicated. Clearly, theory will be of seminal importance to researchers, and those tasks would be impossible without the modern computer.

Last month, a symposium on biophysical theory at the American Chemical Society fall meeting in Washington, D.C., organized by chemistry professors Ronald M. Levy at Rutgers, the State University of New Jersey, New Brunswick, and Richard A. Friesner at Columbia University, and sponsored by the Division of Physical Chemistry, devoted time to that issue. How will computers help sort and categorize the anticipated deluge of proteins? How can structure and function be gleaned from sequence?

"It's a good time to be in this field--just because there's a lot to do," said Stephen H. Bryant, biophysicist at the National Institutes of Health's National Center for Biotechnology Information (NCBI) in Bethesda, Md.

Computers still have a ways to go in obtaining three-dimensional protein structures purely from a sequence of base pairs. IBM's gigantic Blue Gene supercomputer, which is under construction, is expected to take a year to fold one protein from scratch.

"At this point, what people do in protein structure prediction is still highly speculative," said Mark Gerstein , assistant professor in molecular biophysics and biochemistry and computer science at Yale University. "It does not carry the same weight as doing an experiment."

But a computer is ideally suited for sorting through reams of sequences to start making some immediate sense of a genome. Computers can cluster proteins into sequence and structural categories, make comparisons, and discover trends that allow scientists to group proteins that are likely to look and behave like known entities. Computers can also enable scientists to pick out unknown sequences that would be good candidates for experiment.

The program CN3D graphically displays protein structures aligned by NCBI's VAST algorithm. The regions where the proteins superimpose are in red.

Protein-fold data in particular are a valuable, simplified way to get at genomic information, Gerstein said. Folds are units of protein structure, such as the well-known /-barrel and ferrodoxin folds. Smaller proteins, like myoglobin, are composed of a single fold, and a single fold can have multiple functions. But although a genome may contain tens of thousands of genes, these will ultimately translate into only 1,000 to 10,000 different folds. Gerstein and his colleagues take a statistical route: probing fold databases with what he calls a "finite parts list approach."

This is analogous to analyzing a set of bicycle parts, Gerstein said. The questions that can be asked about such parts--Where are they located? What parts are shared? Which are common? Which are unique?--are the same questions that can be asked about protein folds. If certain sequences are known to produce certain folds, a computer can cull them from genome databases. And if proteins contain similar folds, which can be discovered from fold libraries, it's likely they serve similar purposes. For example, folds with certain structures are frequently associated with enzymes, and other folds are associated with proteins that are not enzymes.

Other interesting trends can be teased out of these data sets. The genes of an organism express different proteins at different levels at different times, depending on what's going on--growing, reproducing, aging, and so on. Some protein-fold expression levels are highly variable--they change a lot over time--and some levels remain constant.

Levy

Bryant [Photos by Elizabeth Wilson]

When Gerstein interrelated gene expression patterns in yeast with the fold database, he found that the folds expressed most commonly don't have much variability in their expression levels, whereas the expression levels of the less common folds tend to fluctuate more wildly. Gerstein explained that the folds that are more common do "housekeeping" things that require their constant presence. The folds that aren't as common usually have switchlike functions--they turn on and off.

Managing and developing genomic and protein databases as well as the ability to search them is a challenge in itself, Bryant said. "People like me and others are working on the basic machinery to compare molecules, but we're all moving toward ways to start classifying proteins into families," Bryant said. "On the one hand, it's a very obvious thing to do, and on the other hand, it's extremely difficult to do."

To that end, Bryant and his colleagues develop and maintain a host of interlinked programs and databases designed to compare, cross-reference, and align proteins and sequence data. Entrez, a Web-based search and retrieval system, integrates NCBI's numerous genomic and protein databases. NCBI's molecular modeling database complements Entrez with protein structures. Its Web-based Blast software, which searches for similarity between sequences, gets 70,000 hits a day, Bryant said.

Bryant's group has also participated in nationwide tests of protein prediction techniques, producing, for example, a structure of methylglyoxal synthase, which compared well with the structure determined by experiment.

Meanwhile, scientists are still working furiously on the problem of protein structure prediction. There are a number of ways to approach the problem. One can start from absolute scratch using only an amino acid sequence and the laws of physics or statistical rules with no reference to a known protein. But this ab initio strategy, as has been noted, is extremely time-consuming and expensive.

Gerstein

Mak

A simpler approach is to look at proteins with similar structure that belong to certain families, such as hemoglobins. With an amino acid sequence in hand, scientists can use databases of known sequences and their structures to search for matches. If the sequence of the unknown structure is similar enough to one that's known, scientists can assign a function to the unknown based on the known sequence. However, as David T. Jones , bioinformatics professor at Brunel University, London, noted, only perhaps 30% of proteins in a genome can be solved in this comparative manner, because there must already be data on hand about at least one family member.

Another method, known as threading, is similar, except that it tries to match a sequence with known protein folds, rather than another sequence. "The idea came from the observation that proteins with quite different sequences often seem to have the same 3-D structure," Jones said. The method is straightforward: Pick a known fold in a database, calculate the energy the mystery sequence would have if it assumed that configuration, and repeat until the best fit is found.

However, this method still covers only about 20 to 30% of a genome's proteins.

So what to do with the remaining 50% or so of proteins for which sequence information alone is insufficient for learning its function? The ab initio methods must come into play.

Even within the ab initio realm, however, a number of methodological approaches exist. Jones's group has developed what he calls a "knowledge-based" method that "relies mostly on a set of statistical rules based on known structures to 'guess' at what the correct structure might be," he said.

The five most common protein folds in three representative genomes from the eukaryotic, eubacterial, and archeal classes.

Levy's group at Rutgers has performed all-atom simulations of protein folding, and one large computational bottleneck stands out: the solvent. It's vital to include the solvent surrounding the protein in order to get it to fold correctly, but explicitly treating all the solvent molecules drastically ups the cost and computer time. Levy's group is developing ways to treat the protein solvent implicitly--yet still realistically--as a bulk material, rather than considering each molecule individually.

As the field develops, these various methods will hybridize until there becomes a continuum of techniques, Levy said.

A novel protein structure prediction method comes from Chi Ho Mak , chemistry professor at the University of Southern California. A number of so-called folding engines exist, but they suffer from various limitations. For example, they usually work only on proteins that are composed largely of helices. Helices characterized by the formation of many local hydrogen bonds are easily simulated. However, -sheets form with difficulty, because the hydrogen bonds must form between residues far away from each other along the primary sequence.

These ab initio folding methods are also good only for small proteins, those with up to 50 residues. As the number of residues increases, so does the number of conformational possibilities that the program must choose from, and the time and cost of the calculation increases exponentially.

Folding engines based on the popular computational methods known as Monte Carlo start with an unfolded protein, begin a process of twisting the strand at one spot, and then determine the energy of the new conformation. If that energy is lower than the previous one, the program accepts the new conformation; otherwise, the program twists the protein again, calculates a new energy, and so on, until the lowest energy conformation is reached.

The energy calculation is where most of the time, and thus the money, goes. So Mak decided to try "reversing" the process. He would set up the program to generate new conformations that are more likely to be accepted before the energy is calculated. Mak explained that

Mak's smart Monte Carlo engine simulation of the folding of the villin protein headpiece subdomain HP-36, in 16 steps starting from top left and reading across

the hydrogen-bond energies of the current conformation have already been calculated. Therefore, if you select the weakest of those bonds and break them preferentially to create the new conformation, then the new conformation will have a better chance of having a lower energy than the one selected randomly.

Mak and his colleagues tested their algorithm on about 15 small proteins, ranging from 36 to 60 residues. For many of those proteins, the algorithm worked well in comparison to conventional algorithms, he said, gaining sometimes more than a 10-fold increase in calculation speed.

But it doesn't work as well on other proteins, Mak said. For example, they still encounter the same problems with -sheet formation as the traditional folding engines. "We don't know why that is," Mak said.

Ultimately, it's important to recognize that, at least right now, computers don't have all the answers, Levy said. His own lab has plans to align with experimentalists who study residual dipolar coupling in nuclear magnetic resonance spectroscopy to enhance their ability to predict protein structure. Experimental and predictive techniques can combine for a result that's more powerful than either one would have been separately, he said. "These partnerships--working together--that's really the frontier."

[Previous Story] [Next Story]

Top

Theorists pore over biomembranes

Biological membranes--lipid bilayers studded with proteins and other biomolecules--are complex enough. But a detailed prediction of how their various components interact, move, or transport ions is almost too much for even a computer.

To model such large and intricate systems rigorously and exactly, each molecule and its interactions with every other molecule in the system must be included in the problem. As the molecules are piled on, this strategy rapidly balloons into a project taking much more time and money than most scientists have, even with today's powerful computers. A computer simulation of only a microsecond of a protein in a membrane can take weeks or months.

Fortunately, theorists are experts at finding ways to simplify the job for a computer while sacrificing as little accuracy as possible. One common strategy, for example, is to treat bulk surrounding material, such as a solvent, as a featureless force field, while keeping the details in a smaller focal object, such as a pore. A number of evolving methods surfaced at the biophysical theory symposium at the American Chemical Society meeting in Washington, D.C., last month, as did new insight into the behavior of ions as they flow through a channel.

Benoït Roux , physiology and biology professor at Weill Medical College of Cornell University, and his graduate student Wonpil Im have developed a novel statistical algorithm based on Monte Carlo methods and Brownian dynamics. It simulates the random jiggling of ions under the influence of membrane potential as they flow through membrane channels.

A snapshot of a simulation of K⁺ and Cl^- ion flow through the porin protein OmpF of E. coli, using Roux's method. [© Biophysical Society, Biophys. J., 79, 788 (2000)]

They treat the ions inside a small imaginary circle surrounding the pore individually and calculate their trajectories using Brownian dynamics. Roux and Im also treat the ions in a larger "buffer region" surrounding the inner region explicitly, keeping them in equilibrium by means of the Grand Canonical Monte Carlo method. Finally, the outside regions are treated implicitly, as an electrostatic potential. As a demonstration of the method, Roux and Im simulated the flow of K⁺ and Cl^- ions through the porin protein OmpF of Escherichia coli.

Max L. Berkowitz , chemistry professor at the University of North Carolina, Chapel Hill, is inspecting another important component of membranes, cholesterol. The usually maligned molecule is actually vital to membrane function. With its polar head, rigid ring structure, and hydrocarbon tail, cholesterol regulates membrane fluidity.

Berkowitz's simulations, which agree well with experiments, are helping scientists understand how sterols affect properties of membranes, including permeability and ion transport. Berkowitz and his postdoctoral researcher Alexander M. Smondyrev have performed a number of simulations of phospholipid membranes and sterol molecules. Using molecular dynamics, they simulated 2 to 4 nanoseconds in the life of dipalmitoylphosphatidylcho line bilayers, varying the concentrations of cholesterol molecules. Berkowitz and Smondyrev compared these with simulations of membranes containing cholesterol sulfate. The sulfate molecule, with its larger, charged headgroup, reduces the tendency for sterols to shrink membrane area and also changes the electrostatic properties of membranes.

They also modeled dimyristoylphosphatidylcholine bilayer systems with other sterols, such as ergosterol and lanosterol. A simulation of a lanosterol-containing membrane revealed an unusual feature: a lanosterol molecule actually lying horizontally along the plane of the membrane, rather than vertically. "This has never been seen before," Berkowitz said. It indicates that membranes with lanosterol have different dynamical and structural properties compared with membranes with cholesterol, he added. "Maybe this is why nature chose cholesterol over lanosterol in the evolution process."

Membrane pores are extremely selective: Some ions, such as potassium or rubidium, will slip through, whereas sodium finds itself barred at the gate. An ion's hydration may have something to do with that selectivity, said Susan B. Rempe, a chemist at Los Alamos National Laboratory . Ions proceed into a channel, which acts like a filter--it's wide at the mouth, narrowing down midway. Consequently, Rempe said, the ion has to give up all its surrounding water molecules, except perhaps its inner hydration shell. The ion may then be reacting with side carbonyl groups in the pore wall, but scientists don't yet know all the details.

Simulation of lanosterol (green) in a bilipid membrane. The simulation shows a previously unseen conformation of a lanosterol molecule lying along the plane of the membrane.

Though hydration numbers for ions are widely considered to be set in stone, Rempe's group found that molecular dynamics simulations of hydrated ions produced hydration numbers that differed from common wisdom. For example, the hydration number of K⁺ has been thought to be 8, but Rempe's simulation indicates that it's less than that. Li⁺ appears to have a hydration value of around 4, not 6 as previously thought. And Na⁺'s average is 4.6, not 6.

Compared with more traditional gas-phase electrostatic calculations, which Rempe's group also performed, their molecular dynamics simulations had the advantage of treating many more water molecules explicitly. The gas-phase calculation alone isn't enough to predict hydration, Rempe said, as the presence of additional molecules changes the ion-water interactions in the inner hydration shell.

Some preliminary gas-phase simulations of sodium in an ion channel, Rempe said, had unexpected results. They surrounded Na⁺ with five water molecules and selected formamide to represent a carbonyl group in the channel wall. When they calculated the free energy of the reaction, they expected that the ion would dehydrate. Instead, Na⁺ keeps its shell of water and the formamide adds directly to it. That was a surprise, Rempe said. In contrast, the K⁺ ion sheds a water molecule when it picks up a formamide molecule. The group now plans to simulate the same systems of water, formamide, and an ion with molecular dynamics.

[Previous Story] [Next Story]

Top



Home \| Table of Contents \| News of the Week \| Cover Story
Business \| Government & Policy \| Science/Technology

Chemical & Engineering News Copyright © 2000 American Chemical Society - All Right Reserved 1155 16th Street NW • Washington DC 20036 • (202) 872-4600 • (800) 227-5558