MB&B 452a

Spring 2005

BIOINFORMATICS
SECTION

http://www.gersteinlab.org/courses/452/05-spr/bioinfo.html

Fall 2000, Fall 2001 Fall 2002 Fall 2003


CLASS OVERHEADS
Intro Sequences Structures Expression Databases Surveys Datamining Simulation (2001) Summary (2001)

NEW 2005 OVERHEADS
Databases
Datamining
Advanced Datamining
Genome Data and Tool Interoperation

Bioinformatics describes the computational analysis of gene sequences, protein structures, and expression datasets on a large scale. Specific topics include sequence alignment, biological database design, geometric analysis of protein structure, and macromolecular simulation.

Timing and Location

Meeting from 1:00-2:15 PM on Mondays and Wednesday, in Bass 305.

Instructor

Mark Gerstein
Bass 432A, Phone 203 432-6105, e-mail Mark.Gerstein@yale.edu (Office hours right after class)

Teaching Fellows

General Information

The bioinformatics module will follow a very similar progression to the course offered last fall.

Also, see other related on-line lectures.

Prerequisites

Research Jobs in Bioinformatics

If you're really motivated, take a look at http://bioinfo.mbb.yale.edu/jobs.

Use of Overheads and Other Course Materials

If you want to use the overheads in your own course, feel free, as long as you give proper attribution.
(A number of the overheads were derived from related courses at Stanford and Yale and are so acknowledged.)
Most of the reading material is copyright and can NOT be freely distributed. It should not be accessible outside of Yale.

Also, see general Permissions statement.

Related Courses


Class Requirements

Attendence, class participation

Reading

Papers will be assigned throughout the course. These papers will be discussed in weekly sections led by the TAs. These papers can be found here.

Quizzes

2 short ones in class comprising SIMPLE questions that you should be able to answer from the lectures plus the main readings.
April 4 First quiz will cover the first part of the bioinformatics lectures.
April 18 Second quiz will cover the rest of the material in the bioinformatics section up to and including April 13.

Final Projects


Final Project Assignment
Spring 2005 Project Description

Due
At End of Reading Period (April 29). Turn in full printout to Joann DelVecchio in BASS 432 and an electronic version to your TA.

Handed in
Projects done this year (2005) and previously: 2005, 2003, 2002, 2001, 2000, 1999, 1998

Introduction

Overheads
[html] [pdf] [ppt.gz]


Introduction Required Reading
What is Bioinformatics?

Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.
Bioinformatics is MIS for Molecular Biology Information. It is a practical discipline with many applications.


Sequence

Overheads
[html] [pdf] [ppt.gz]

Sequence Alignment Reading

  • Basic Alignment via Dynamic Programming
  • Suboptimal Alignment
  • Gap Penalties
  • Similarity (PAM) Matrices
  • Multiple Alignment
  • Profiles, Motifs, HMMs
  • Local Alignment
  • Probabilistic Scoring Schemes
  • Features of Genomic DNA Sequence
  • Rapid Similarity Search: Fasta
  • Rapid Similarity Search: Blast
  • Practical Suggestions on Sequence Searching
  • Transmembrane helix predictions
  • Secondary Structure Prediction: Basic GOR
  • Secondary Structure Prediction: Other Methods
  • Assessing Secondary Structure Prediction

Structures

Structures Overheads
[html] [pdf] [ppt]

Structures Reading

  • What Structures Look Like?
  • RMS Superposition
  • Structural Alignment by Iterated Dynamic Programming
  • Scoring Structural Similarity
  • Protein Geometry
  • Calculation of Surface Area
  • Calculation of Volume
  • Standard Volumes and Radii

Expression

Expression Overheads
[html] [pdf] [ppt]

  • Background Correction
  • Cy5/Cy3 Normalisation
  • Merging replicated experiments
  • Scoring differential hybridysation

Databases

Database Overheads
[html] [pdf] [ppt]


Databases Reading

  • Structuring Information in Tables
  • Keys and Joins
  • Normalization
  • Complex RDB encoding
  • Indexes and Optimization
  • Forms and Reports

Surveys

Surveys Overheads
[html] [pdf] [ppt]

Surveys Reading

  • Parts Lists: homologs, motifs, orthologs, folds
  • Fold Library
  • Overall Sequence-structure Relationships, Annotation Transfer
  • Parts in Genomes, shared & common folds
  • Genome Trees
  • Extent of Fold Assignment: the Bias Problem
  • Bulk Structure Prediction
  • The Genomic vs. Single-molecule Perspective
  • Understanding Biases in Sampling
  • Relationship to experiment: LIMS, target selection
  • Function Classification
  • Cross-tabulation, folds and functions
  • Clustering & Trees
  • Analysis of Expression Data
  • Clustering
  • Bayesian Analysis
  • Analysis of Other Whole Genome Datasets

Datamining

Datamining Overheads
[html] [pdf] [ppt]

Data Mining Reading

  • Relating Gene Expression to Protein Features and Parts
  • Supervised Learning: Discriminants
  • Simple Bayesian Approach for Localization Prediction
  • Unsupervised Learning: k-means
  • Correlation of Expression Data with Function
  • Overview of Issues in Datamining
  • Overview of Methods of Supervised Learning
  • Focus on Decision Trees
  • Overview of Methods of Unsupervised Learning
  • Cluster Trees, Evolutionary Trees

Simulations

Simulation Overheads
[html] [pdf]

Simulations Reading

  • Packing
  • Basic Forces: Electrostatics
  • VDW Forces
  • Bonds as Springs
  • Energy Minimization
  • Monte Carlo
  • Molecular Dynamics
  • Energy and Entropy
  • Parameter Sets
  • Number Density
  • Poisson-Boltzman Equation
  • Lattice Models and Simplification


Summary Class Overheads[html] [pdf]

READINGS

Introduction Readings

Nicholas M Luscombe, Dov Greenbaum & Mark Gerstein (2001).
What is bioinformatics? A proposed definition and overview of the field
Methods Inf Med. 2001;40(4):346-58. REQUIRED FOR SECTION 6


D Greenbaum, N M Luscombe, R Jansen, J Qian, M Gerstein. (2001)
Interrelating Different Types of Whole-genome Data, from Proteome to Secretome: 'Oming in on Function
Genome Res. 2001 Sep;11(9):1463-8  REQUIRED FOR SECTION 6


Sequence Reading

Sequence Alignment Reading

Needleman, S. B. and Wunsch, C. D. (1971). "A general method applicable to the search for similarities in the amino acid sequence of two proteins." J. Mol. Biol. 48: 443-453. REQUIRED FOR SECTION 7
(The original paper. Still pretty easy to read. Will be used in class.)

D J States & M S Boguski, "Similarity and Homology," Chapter 3 from Gribskov, M. and Devereux, J. (1992). Sequence Analysis Primer. New York, Oxford University Press. REQUIRED FOR SECTION 7
(Focus on dynamic programming section of this chapter.)

Scoring Reading

Altschul, S. F., Boguski, M. S., Gish, W. and Wootton, J. C. (1994). Issues in searching molecular sequence databases. Nature Genetics. 6(2): 119-29. REQUIRED FOR SECTION 8
(Most important. A short overall review.)

Alschul et al. (1998). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Res 1997 Sep 1;25(17):3389-402 REQUIRED FOR SECTION 9

M Levitt & M Gerstein (1998). A Unified Statistical Framework for Sequence Comparison and Structure Comparison. Proceedings of the National Academy of Sciences USA 95: 5913-5920
(Understand the concept of P-value and the framework for deriving scoring statistics.)

Pearson, W. R. (1996). Effective Protein Sequence Comparison. Meth. Enz. 266: 227-259.
(Understand how the FASTA e-value is derived.)

Multiple Alignment Reading

Eddy, S. R. (1996). "Hidden Markov models," Curr. Opin. Struc. Biol. 6, 361-365.

Eddy, S. R. (1998) "Profile hidden Markov models," Bioinformatics 14(9):755-63. REQUIRED FOR SECTION 8

Higgins, D. G., Thompson, J. D. & Gibson, T. J. (1996). "Using CLUSTAL for multiple sequence alignments," Methods Enzymol 266, 383-402.

Secondary Structure Prediction Reading

Garnier, J., Gibrat, J. F. & Robson, B. (1996). "GOR method for predicting protein secondary structure from amino acid sequence," Methods Enzymol 266, 540-53. REQUIRED FOR SECTION 9

King, R. D. & Sternberg, M. J. E. (1996). "Identification and application of the concepts important for accurate and reliable protein secondary structure prediction," Prot. Sci. 5, 2298-2310.

Extra Sequences Reading

Smith, T. F. and Waterman, M. S. (1981). "Identification of common molecular subsequences." J. Mol. Biol. 147: 195-197
(The original paper on local alignment. Not quite as easy to read, but introduces this important concept.)

Frishman D, and Argos P. (1997) "The Future of Protein Secondary Structure Prediction Accuracy," Folding & Design 2:159-62.
(Controversial idea: secondary structure prediction to 80%?)

M Gerstein (1998). "Measurement of the Effectiveness of Transitive Sequence Comparison, through a Third ‘Intermediate’ Sequence," Bioinformatics 14: 707-14.


Databases, Datamining and Surveys

Databases and Surveys Main Reading

M Gerstein (2000). "Integrative database analysis in structural genomics," Nature Structural Biology 7: 960-963.

Korth & Silberschatz, Database System Concepts [amazon]
(CS book on databases; Read pages 1 to 65 [sections 1.0 to mid-3.2] and pages 97 to 108 [part of section 4.1].

M Gerstein & R Jansen (2000). "The current excitement in bioinformatics, analysis of whole-genome expression data: How does it relate to protein structure and function?" Current Opinion in Structural Biology 2000, 10:574–584.

Eisen MB, Spellman PT, Brown PO, & Botstein D (1998). "Cluster analysis and display of genome-wide expression patterns," Proc Natl Acad Sci U S A 1998 95: 14863-8

Extra Databases and Surveys Readings

J Lin & M Gerstein (2000). "Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels," Genome Res. 10: 808-1

S Teichmann, C Chothia & M Gerstein (1999). "Advances in Structural Genomics," Curr. Opin. Struc. Biol. 9: 390-399.

M Gerstein & W Krebs (1998). "A Database of Macromolecular Movements," Nuc. Acid. Res. 26:4280-4290.

Fred Tekaia, Antonio Lazcano & Bernard Dujon (1999). "The Genomic Tree as Revealed from Whole Proteome Comparisons," Genome Res. 9:550-557

H Hegyi & M Gerstein (1999). "The Relationship between Protein Structure and Function: a Comprehensive Survey with Application to the Yeast Genome," J Mol. Biol. 228: 147-164.

M Gerstein & H Hegyi (1998). "Comparing Microbial Genomes in terms of Protein Structure: Surveys of a Finite Parts List," FEMS Microbiology Reviews 22: 277-304.

M Gerstein (1998). "Patterns of Protein-Fold Usage in Eight Microbial Genomes: A Comprehensive Structural Census," Proteins 33: 518-534.
(This is an example of the application of large-scale, database-style calculations.)

Tomb, J.-F., White, O., Kerlavage, A. R., Clayton, R. A., Sutton, G. G., Fleischmann, R. D., Ketchum, K. A., Klenk, H. P., Gill, S., Dougherty, B. A., Nelson, K., Quackenbush, J., Zhou, L., Kirkness, E. F., Peterson, S., Loftus, B., Richardson, D., Dodson, R., Khalak, H. G., Glodek, A., McKenney, K., Fitzegerald, L. M., Lee, N., Adams, M. D., Hickey, E. K., Berg, D. E., Gocayne, J. D., Utterback, T. R., Peterson, J. D., Kelley, J. M., Cotton, M. D., Weidman, J. M., Fujii, C., Bowman, C., Watthey, L., Wallin, E., Hayes, W. S., Borodovsky, M., Karp, P. D., Smith, H. O., Fraser, C. M. & Venter, J. C. (1997). "The complete genome sequence of the gastric pathogen Helicobacter pylori," Nature 388, 539-547.
(This research article describes one of the recent genome sequences.)

Cavalli-Sforza, L. & Edwards, S. (1967). "Phylogenetic analysis: models and estimation procedures," Evolution 21, 550-570.

M Gerstein (1998). "How Representative are the Known Structures of the Proteins in a Complete Genome? A Comprehensive Structural Census," Folding & Design 3: 497-512.

Fitch, W. M. (1971). "Toward defining the course of evolution: minimum change for a specific topology," Syst. Zool. 20, 406-416.

Swofford et al. (1994). "Phylogeny reconstruction," In Molecular Systematics (2nd ed.), Sinauer Press.
(This book chapter is a good reference thought not a neccessary reading.)


Structures and Simulations

Structures Required Reading

Holm, L. and Sander, C. (1993). Protein Structure Comparison by Alignment of Distance Matrices. J. Mol. Biol. 233: 123-128.
(A different method of structural alignment, which differs more from sequence alignment.)

M Gerstein & M Levitt (1998). "Simulating Water and the Molecules of Life," Scientific American 279: 100-105.

Extra Structures Reading

M Gerstein & M Levitt (1998). "Comprehensive Assessment of Automatic Structural Alignment against a Manual Standard, the Scop Classification of Proteins," Protein Science 7: 445-456.
(Understand the method, not results, in this paper OR in Gerstein & Levitt (1996))

M Gerstein & F M Richards, "Protein Geometry: Volumes, Areas, and Distances," (2000) chapter 22 of volume F of the International Tables for Crystallography ("Molecular Geometry and Features" in "Macromolecular Ccrystallography")

J Tsai, R Taylor, C Chothia & M Gerstein (1999). "The Packing Density in Proteins: Standard Radii and Volumes," J. Mol. Biol. 290: 253-266.

Taylor, W. R. & Orengo, C. A. (1989). Protein Structure Alignment. J. Mol. Biol. 208, 1-22.

Kuntz, I. D. (1992). Structure-Based Strategies for Drug Design and Discovery. Science 257, 1078-1082.
(Docking. See link below for more information.)
http://www.cmpharm.ucsf.edu/kuntz

Richards, F. M. (1977). Areas, Volumes, Packing, and Protein Structure. Ann. Rev. Biophys. Bioeng. 6, 151-76.

Richards, F. M. (1974). The Interpretation of Protein Structures: Total Volume, Group Volume Distributions and Packing Density. J. Mol. Biol. 82, 1-14.
(Original Application of Voronoi Method to Proteins. See Int. Tabl. document above for more details on method.)

Pattabiraman, N., Ward, K.B. and Fleming, P.J. (1995) Occluded Molecular Surface: Analysis of Protein Packing, Journal of Molecular Recognition, 8:334-344
http://csbmet.csb.yale.edu/userguides/datamanip/os/os_descrip.html  -- OS

Joan Pontius, Jean Richelle, Shoshana J. Wodak (1996). Deviations from Standard Atomic Volumes as a Quality Measure for Protein Crystal Structures. Journal of Molecular Biology 264: 121-136.

Barry Cipra (1998). Packing Challenge Mastered At Last, Science 281: 1267

Simon Singh (1998). Mathematics Proves What the Grocer Always Knew, New York Times (August 25).

McCammon, J. A. & Harvey, S. C. (1987). Dynamics of Proteins and Nucleic Acids. Cambridge UP.

Honig, B. & Nicholls, A. (1995). Classical electrostatics in biology and chemistry. Science 268, 1144-9.

Information on Liquid Simulation Methods (excerpted from a thesis, 1992)

Levitt, M. (1983). Protein folding by restrained energy minimization and molecular dynamics. J Mol Biol 170, 723-64.

Allen, M. P. & Tildesley, D. J. (1987). Computer Simulation of Liquids. Claredon Press, Oxford. (A good reference.)

Karplus, M. & McCammon, J. A. (1986). The dynamics of proteins. Sci. Am. 254, 42-51. (A good reference.)

Duan, Y. & Kollman, P. A. (1998). Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution Science 282, 740-4.

Sharp, K. (1999). Electrostatic Interactions in Proteins. In International Tables for Crystallography, International Union of Crystallography, Chester, UK.

Dill, K. A., Bromberg, S., Yue, K., Fiebig, K. M., Yee, D. P., Thomas, P. D. & Chan, H. S. (1995). Principles of protein folding--a perspective from simple exact models. Protein Sci 4, 561-602.

Franks, F. (1983). Water. The Royal Society of Chemistry, London. Pages 35-56.


"Fun" Pop Reading (Extra)

"Fathering life and other feats," Economist, 2 February 1999
(About synethetized M. genitalium)

"The Gutenberg Internet," Wall Street Jounal, June 11, 1999

"The hot new job in agriculture is bioinformatics," Work Week, Wall Street Jounal, August 17, 1999, A1

Antonio Regalado (1999), "Mining the Genome," MIT TechReview, Sept/Oct. issue.

Charles C. Mann, "Biotech Goes Wild," TechReview, July/August

DAVID STIPP, "GENE CHIP BREAKTHROUGH, FORTUNE, 03/31/1997

Economist, 6/28/99, "Science & Technology: Drowning in data"

GEORGE JOHNSON, "Searching for the Essence of the World Wide Web," April 11, 1999

HENRY FOUNTAIN, "Hiding Secret Messages Within Human Code," New York Times, June 22, 1999, F5

J L Weldon. "A Career in Data Modeling," Byte, June 1997
(Practical hands-on discussion of data modeling in commercial context, many of the same issues apply in bioinformatics.)

J L Weldon. "Data Warehouse Building Blocks," Byte, January 1997

J L Weldon. "Warehouse Cornerstones," Byte, January 1997
(Other, less relevant articles, on the some of the practical hardware issues in database design.)

J L Weldon. "RDBMSes Get a Make-Over," Byte, April 1997
(Practical discussion of what an object database is.)

Johnson, G. (1997). "Proteins Outthink Computers in Giving Shape to Life," New York Times. March 25, 1997, C1.

Johnson, G. (1997). "Proteins Outthink Computers in Giving Shape to Life," New York Times. March 25, 1997, C1.

L Hunter (ed), AI and Molecular Biology, AAAI Press (A new intro. text)

L. Fisher (1999). "Surfing the Human Genome; Data Bases of Genetic Code Are Moving to the Web," New York Times. 09/20/99, C1

Langreth, R. (1997). "Scientists Unlock Sequence Of Ulcer Bacterium's Genes," Wall Street Journal. 7 August.

Lisa Belkin, "Splice Einstein and Sammy Glick. Add a Little Magellan," New York Times Magazine, 08/23/98, Page 26 (Article on J C Venter)

M Gerstein (1999). "E-publishing on the Web: Promises, Pitfalls, and Payoffs for Bioinformatics," Bioinformatics 15: 429-431.

MARLISE SIMONS, "Team of Scientists to Prepare a Rolodex of Life on Earth," New York Times, July 27, 1999, F2

N Wade, "Who'll Sequence Human Genome First? It's Up to Phred," New York Times, March 23, 1999, F2

NICHOLAS WADE, "Cambridge Lab Keeps Britain Ahead in Genome Stakes," New York Times, October 6, 1998

NICHOLAS WADE, "Gains Are Reported in Decoding Genome," New York Times, May 22, 1999, A4

PAMELA LICALZI O'CONNELL "Beyond Geography: Mapping Unknown of Cyberspace," New York Times, September 30, 1999

Pollack, A. (1998). Drug Testers Turn to'Virtual Patients' as Guinea Pigs. New York Times, Nov. 10

Primer on Molecular Genetics from the DOE

ROBERT LANGRETH, "CuraGen's Finds 55,000 Variations Of Genes, Auguring Tailored Drugs," Wall Street Jounal, August 16, 1999

Steven Vogel, "Academically Correct Biological Science", American Scientist, November-December 1998

Tanouye, E. & Langreth, R. (1998). "SmithKline-Glaxo Deal Driven By the Hunt for Human Genes," Wall Street Journal. February 2.

Wade, N. (1997). "Now Playing at a Nearby Lab : 'Revenge of the Fly People,'" New York Times. 05/20/97, C1.

Wade, N. (1997). "Scientists Map Ulcer Bacterium's Genetic Code," New York Times. August 7.

Wade, N. (1997). "Thinking Small Paying Off Big In Gene Quest," New York Times. 02/03/97, A1.

WILLIAM K. STEVENS, "Rearranging the Branches on a New Tree of Life," August 31, 1999, F1


Image

The DNA-mouse image is adapted from the GCB-98 homepage. What's wrong with the adaptation?


MB&B Department, Bass building, Yale University, New Haven, CT 06520

[home]  Lab Home