Yale Genomics & Bioinformatics

CBB752/CPSC452/CPSC752/MBB452/MBB752/MCDB452/MCDB752

http://www.gersteinlab.org/courses/452/

Lecture Summaries

Genomics Section
Sept 06 DS & MG	Genomics: Overview. The Diversity of Life Overview of syllabus and topics covered in the course. Genomes vary greatly in terms of GC content, size, and gene density. Size ranges from Nanoarchaeum equitans with less half a million bases and 552 genes to Zea mays (corn) with about 5 billion bases and 60,000 genes. It is thought that approximately 250 to 350 genes are essential for life, but the exact number is not known. Life can be divided into bacteria, archaea, and eucarya. Genomics: Overview. The Diversity of Life Overview of topics covered.
Sept 11 DS	Genomics: Overview. The Diversity of Life Life exists at extremely high temperatures (thermophiles and hyperthermophiles), low temperatures (psychrophiles), high salt (halophiles), extreme pH (acidophiles and alkaliphiles, high pressures (piezophiles or barophiles), and other extreme conditions. Prokaryotes are overwhelmingly abundant and extremely important to all life. Genomics: Overview of Proteins and Nucleic Acids Nucleic acid commonly contains 4 bases A,G,C, and T in DNA A, G, C, and U in RNA Modified nucleotides are common, particularly in tRNA Proteins are composed of amino acids There are 20 common amino acids Two additional amino acids may be co-translationally inserted (selenocysteine and pyrrolysine) A number of additional amino acids may be created by post-translational modifications
Sept 13 DS	Genomics: Overview of Proteins and Nucleic Acids There is no single "strategy" used to make thermostabile enzymes. Different enzymes have different features that make them thermostable. Genome Sequencing, Sequence Annotation A long enough sequence of DNA is likely to be unqiue in the genome and can be detected with a probe of complementary sequence. Short stretches of DNA can be sequenced using the chain terminator method. Genomes can be sequenced using either hierarchical or shotgun sequencing. Genomes are sequenced with multiple coverage (i.e. sequencing the same region of the genome ten times) to assure accuracy. Lateral gene transfer allows for the transfer of genes between species and means that the tree of life is not a strict hierarchy.
Sept 18 DS	Genome Sequencing, Sequence Annotation Synteny maps show regions of chromosomes that are similar between separate species. There are a number of bioinformatics tools available through NCBI, EBI, GenomeNet, and the Joint Genome Institute. BLAST finds similarity between protein and nucliec acid sequences. This can be used to search for orthologs of known genes and proteins. The results of BLAST can be viewed as a phylogeny. The Protein Data Bank (PDB) contains solved structures of proteins and nucleic acids. SCOP is used to classify proteins based on their structure. Structural Genomics Structural genomics seeks to solve representative structures for all available proteins. It is akin to the Human Genome Project in that it seeks to generate and organize data rather than solve specific hypotheses.
Sept 20 DS	More Structural Genomics Structural Genomics attempts to determine the structure of proteins. It's possible to predict the function of a protein by comparing its structure to other proteins of known function. These predictions can then be confirmed with biochemical analysis. Knowing protein structure and function allows for a better understanding of important pathways and mechanisms (ex: glycolytic pathway). However, even though proteins may have similar structure, it is possible that they will bind different substrates and thus have different functions. Despite the large number of protein sequences, there are a more manageable number of protein structures, and therefore a structural hierarchy is a useful classification of proteins. Chemical Genomics Chemical Genomics is the study of small molecules applied to biology; specifically it seeks to determine which molecules interact with which protein classes and how they affect the proteins' functions. There are a number of methods for measuring the affinity for small molecules to bind to proteins. High throughput screens are used to test many different chemicals at the same time. Florescence can act as a label of, for example, cells that are able to survive despite induction of a toxic gene. Microarrays are used to test for drug sensitivity and resistance.
Sept 25 DS	Genomic analysis of a cell/organism can be broken down into analysis of 1) the genes themselves, 2) mRNA expression, and 3) protein interactions. Trancriptomics analyzes gene expression in mRNA, using DNA arrays and Genechips as the latest technology. Abundant mRNA does not necessarily imply abundant protein, so analysis at all three levels is important. A new level of analysis, known as metabolomics, has developed as well for the purpose of examining functional protein interactions in a cell. Metabolites can be used as bio-markers without knowing their biochemical details, the presence or absence of metabolites alone can be used to indicate a particular biological state. (e.g. whether a drug was taken or not). Metabolites are fairly cheap and noninvasive - an 8-900 Mhz NMR can easily separate 25,000 of them.
Sept 27 DS	Pattern matching against a ``compendium of profiles" Test an organism under a number of different states and you can determine a set of genes that is characteristic for a specific state. A set of genes doesn't necessarily focus on what each gene specifically does (or the pathway the genes are in), but looks at the set as a whole and can use this to determine markers for a particular state of an organism. Examples used in class were: Hughes et al. (2000) Cell 102, 109-126 - Determined a profile for S. cerevisiae w mutant in ERG28 Golub, T.R. et al. (1999) Science 286, 531-537 - Determined highly and lowly expressed genes in patients with acute lymphoblastic leukemia vs. acute myeloid leukemia. Joyce et al. (2000) Nature Genet. 3, 462-473 - Clustered similar genes from different organisms. Proteomics Many things can be done with proteomics -Look at chart from Graves & Haystead (2002) Microbiol. Mol.Biol. Rev. 66, 39-63. All proteins cannot be isolated due to difficulties in dealing with their chemical properties, isoelectric points, and size. Can identify proteins by 1 or 2 D polyacrylimide gels. Can identify protein sequence from Mass Spec Scans Can use Isotope Coded Affinity Tax (ICAT) method for measuring differential protein expression - In this method, you can get proteins from two different sets (or cells) analyzed at once. Metabolomics This is not a new area, but new techniques have brought Metabolomics to a different, higher level. Metabolomics seeks to find the sum of metabolites found in an organism/cell. Metabolomics vs. Metabonomics Ways to analyze metabolites are by NMR, LC-MS data, and a few others mentioned in Weckwerth W., Morgenthal, K. (2005) Drug Discovery Today 10, 1551-1558. From data, can identify sets of metabolites that can be used to identify certain states of an organism (like the sets of genes in #1 above) Ex. Kochhar et al. (2006) Anal Biochem. 352, 274-281. Fluxome Def- How are fluxes in various organisms changing Ex. Fischer E, Sauer U. (2005) Nat Genet. 37, 636-640. -Measured fluxes in the glucose pathway by using the C13 isotope and measuring the ratio between C12 and C13 in glucose. Main Conclusions From Lecture Biomarkers can show what is needed for a state and can be tested with the addition of a drug or food. A certain drug may not produce the same effect in different individuals, but metabolomics can separate groups into respond and non-responding groups.
Oct 2 MS	Knowing the DNA sequence of an organism is simply the first step. It is necessary to figure out what each gene in the organism does, and the best way to learn gene function is arguably through inactivation of genes. The three general methods are insertional mutagenesis, systematic knockouts, and RNAi. Insertional mutagenesis via a multipurpose transposon is not gene inactivation per se (since it may not generate a null allele), but it does tag genes with an easily identifiable epitope. This allows analysis of protein localization in yeast. A more advanced model in mice has also allowed similar high throughput research in the essentiality of mouse genes. The hallmark of the systematic knockout approach is (a) bar coding, which allows easy retrieval of which gene is being knocked out, and (b) a complete null phenotype. However, systemic knockouts are done within the framework of currently annotated sequences, so new genes cannot be analyzed by this method. Analysis of knockouts by DNA microarrays allows for the compilation of characteristic expression profiles for various states of the organism, including for programmed knockouts. RNAi was not covered until the following lecture, but is very important because this year’s Nobel Prize went to the scientists who discovered RNAi.
Oct 4 MS	Gene Inactivation Gene inactivation is one of the best ways to learn about an unknown protein's function Phenotypic change -> gene function Cluster genes by resulting phenotype Drug target discovery Three basic strategies: Transposons, knockouts, RNAi. Transposons Discover non-annotated ORFs by randomly inserting a transposon and selecting for in-frame insertions. Can also analyze phenotypes or protein localization. Advantages: simple, cheap, can be used for gene discovery Disadvantages: biased (thus incomplete coverage), may not necessarily generate a null allele Tagged Knockouts Create a vector which, when transformed into diploid yeast, knocks out a specific gene. Sporulate and analyze phenotypes. Can use PCR on barcoded yeast populations to quantify fitness levels Advantages: Always gives null phenotype, comprehensive, barcodes Disadvantages: Expensive, relies on annotated sequence RNAi ~22-bp dsRNA leads to rapid degradation of complementary mRNA Occurs in many organisms: C. elegans, Drosophila, some mammalian cells Klofcshoten et al. (2005) Cell 121, 849-858 - Identification of tumor suppressors Can also be used to knockout entire gene families Advantages: Inexpensive, systematic, lots of control Disadvantages: Limited alleles, off-target effects, some genes not affected Protein-Protein Interactions Obvious interest in discovering interactions - interaction maps, network analysis Three main methods: Two-hybrid, affinity tagging, protein chips Yeast Two-hybrid Split a TF's DNA-binding domain from its activation domain and attach to two proteins. Expression of HIS3 will only occur upon putative interaction. Large scale via matrix or bait/prey pool strategies Two hybrid studies (Utez et al. and Ito et al.) showed little overlap Rual et al. (2005) Nature Vol 437 Advantages: Simple, in vivo assay Disadvantages: Scaling difficulties, restricted to yeast nucleus, ~50% false +'s Affinity tags (TAP-Tagging) Integrate Cam-binding domain and IgG binding domain with YFP Two purification steps, isolate all "stuck" proteins, identify by Mass Spec Advantages: identify entire complexes, in vivo Disadvantages: not necessarily direct interactions, contaminants Protein Microarrays covered in Oct 9's lecture.
Oct 9 MS	Protein microarrays glass slide chips and nano-well arrays; high density array – hundreds to many thousands of proteins important tool for proteomics research; applications: functional analysis of proteins, determine how proteins are modified and regulated, identify protein pathways, protein profiling target identification of biologically active small molecules, i.e. drug discovery and development types of arrays: antibody arrays: antibodies spotted onto protein chip, used as capture molecules to detect proteins; no antibodies for most proteins, problems with specificity functional protein arrays: proteins spotted onto chips; requires: high-quality expression library, methods for preparing large numbers of proteins (gene cloning, protein expression), methods for arraying (printing: random attachment on aldehyde surface; ligand attachment), assays for screening (protein-protein, protein-lipid, protein-DNA interactions; small molecule (drug) screens) Phosphorylation (kinase assays): phosphorylation is catalyzed by various specific kinases (used to transmit signals and control complex processes in cells) use substrate (e.g. protein) arrays to determine specificity of kinases kinase purified, chip incubated with kinase and ATP analyze phosphorylation events, number of proteins phosphorylated by a specific kinase map out phosphorylome; sets of interactions between kinases and proteins → clues about cellular processes and pathways Use of protein chips has many advantages: ability to screen many proteins simultaneously using only small amounts, high throughput, diverse applications; the results, however, have to be validated in vivo.
Oct 16 Patrick O'Donoghue	Phylogenomics Phylogeny trees are useful tools. Even Darwin saw the potential for universal classification. In Darwin’s view, classification would be universal and result from the vertical transmission of traits from parent to offspring. It took until Woese & Fox (1977) to get this universal classification tree through the use of the universally present ribosomal RNA (rRNA). This yielded the “Canonical” or “Universal Phylogenetic Tree (UPT),” the standard tree we think of with the three separated branches of Bacteria, Archaea, and Eucarya, with the Last Universal Common Ancestor at their last shared branch. But as we have seen and discussed, phylogeny is more complicated, and with whole genome sequencing, we see the extent of complications due to horizontal gene transfer (HGT) We can also gain information from ancient and recent homologues (both orthologues and ancient paralogues). But the trees yielded from different proteins and protein families can differ We can also do comparisons on functional sequence of proteins rather than merely direct sequence, to have a better representation of homology (by looking at the level of actual selective pressure for conservation – function (i.e. structure)) Another issue in determining phylogeny from genes is that different genes have different evolutionary rates. Two methods for addressing this problem: Parsimony (look for the tree that requires the fewest changes) Maximum likelihood (look for the tree where, other things being equal, you opt for the trait composition most represented in today’s organisms). There are a number of algorithms and programs for determining phylogeny trees. Another newer method in creating genomes (now possible by the availability of entire sequenced genomes) is phylogeny trees based on entire genomes, or larger clusters or groups of genes. There are advantages and disadvantages to this approach. It can yield a muddier view of the phylogeny tree. Yet it can help differentiate between very close organisms, and also gives a more explicit detailing of the actual gene histories, in all their convoluted glory.
Bioinformatics Section
Oct 25 MG	What is Bioinformatics? Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying ``informatics" techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. Molecular Biology Information DNA Sequence Protein Sequence Macromolecular Structure Protein Structure Details Whole Genomes Other Types of Data: Integrative Data, Gene Expression, Phenotype Experiments, Protein Interactions, etc. ``Informatics" techniques in Bioinformatics Databases Text String Comparison Finding Patterns Geometry Physical Simulation Topics in Bioinformatics Genome Sequence Protein Sequence Sequence / Structure DBs/Surveys Data Mining (Functional) Genomics Simulation
Oct 30 MG	Bioinformatics Boundaries For the purposes of this class the field of bioinformatics is bounded by the analysis of investigations of macromolecules Bioinformatics is: mining digital libraries, gene identification, homology modeling, metabolic pathway simulation Bioinformatics is not involved with: structure determination methods, the DNA Computer, populations of organisms, or artificial life simulations Core Bioinformatics Areas Computing with sequences and structures Protein structure prediction Creating, using, and mining biological databases Metabolic Networks Expression analysis Sequence Alignment and Dynamic Programming Needleman – Wunsch Global Alignment Algorithm 1st step: create a similarity matrix (place 1’s where there are identical matches) 2nd step: Compute the sum matrix starting in the lower right corner Add to the value of the current cell the maximum of the row and column of the cell diagonally down from the current cell (so row -1 and column -1) 3rd step: find the highest score in the first row or column and trace the score to the lower right Suboptimal Alignments For any given alignment there are alignments which result in the maximum score, but introduces more gaps than the optimal alignment (for the simple algorithm) More complexities to the Needleman –Wunsch Algorithm Gap Penalties Gaps = a + bN a = penalty for opening a gap, b = penalty for extending a gap (affine gap penalties), N = length of the gap Substitution Matrices Certain substitutions are more favorable than others throughout evolution, so it can also be taken into account during the alignment procedure. Dynamic Programming Basic idea: breaking down a big problem into smaller subproblems and using the optimal solutions to the subproblems to solve the bigger problem In accordance with Sequence Alignment: The best alignment up to a certain point, is the best alignment up to the previous points plus the best alignment of the current points Once you have aligned the N-terminal of the sequence it is fixed and not changed by the aligning of the C-terminal residues
Nov 1 MG	Similarity (Substitution) Matrixes reflect mathematically that not all amino acid substitutions are equally likely. Can be constructed for nucleotides, too. One method of computing uses log-odds: S = Log2 (observed frequency/expected frequency). Two common matrices are PAM and BLOSUM, both of which are based on empirical data from aligning related proteins. Different matrices are used for proteins of different evolutionary distance from each other. The number after BLOSUM represents percent sequence identity, and the number after PAM represents millions of years. Matrix values change with evolutionary time, and in general, become lower as distance increases. BLOSUM 62 is the default BLAST matrix. Local alignment finds sub-regions of highest alignment score, even if it's at the expense of the overall alignment. Especially appropriate for proteins that do not have similarity across their whole length or for aligning shared domains in otherwise divergent proteins. The modifications include: using negative scores for mismatches, having a minimum score of 0 for a matrix element, and looking for the best score no matter its matrix location. Multiple Sequence Alignments can find information that pair wise alignments can't by, for instance, looking across entire protein families. Special algorithms are needed because pairwise alignment for more than a handful of sequences is limited by the computing requirements. These algorithms are heuristic. Progressive multiple alignments first guess at evolutionary distance between all of the sequences by performing pairwise alignments for each possible pair to build a phylogenetic tree. It then aligns the most closely related sequences and creates a consensus sequence. This consensus sequence is used to align to the next most closely related sequence, whose data in turn is incorporated into the consensus sequence used with the next sequence to be aligned etc. Clustal is a well-known progressive alignment program. Two problems with progressive alignments are the local minimum problem (ex: a false positive incorporated into the consensus sequence early on throws off the later results) and parameter choice problem (one set of parameters may not be optimal for the alignment of every sequence in the set).
Nov 6 MG	HMM - Hidden Markov Model Composed of match states, each corresponding to a column in a multiple alignment. Each state emits symbols, according to symbol-emission probability. Starting from state 0, a sequence of symbols is generated by moving states to state until an end state is reached. Gap, insertion can be easily modeled with HMM. Algorithms: Forward: Summing over the probabilities of all paths that emit a given sequence. Viterbi: Finds the most probable path through the model for a given sequence. Building the Model Expectation Maximization: An algorithm that alternates between performing an expectation (E) step, and a maximization (M) step. EM is greedy. The Score S = ΣS(i,j) – nG, where S=total score S(i,j)=similarity matrix score for aligning i and j n=number of gaps G=gap penalty. Measures of how S ranks relative to all other possible scores: P value Percentile Test Score Rank All-vs-All comparison P(s>S), or the CDF, can be obtained by integrating function p(S). p(S) is the score distribution function that will attempt to fit a TN distribution. Extreme Value Distribution (EVD) fits the observed distributions best. EVD is the maximum of a set of independent variables, whereas Gaussian is the sum of them. ROC Graph The coverage Vs. error rate graph. Score threshold: points on the graph that indicates the maximum coverage with the corresponding error rate. Can be used to compare the effectiveness of two “methods” in finding distant homologues. The significance of similarity scores decreases with database growth.
Nov 8 MG	Computational Complexity The NW algorithm requires filling in MN squares, and each square requires a maximum of M+N operations to find the max. Therefore the complexity is O(n^3). However by storing the max value at each square, the time complexity can be reduced to O(n^2). Memory is also O(n^2). FASTA* FASTA is an alignment tool which first takes all k length words in the query sequence and lists them with their query location in a hash table. Then FASTA traverses the database sequences and takes each word of length k in the database and finds its corresponding locations in the query from the hash table. These exact matches are then extended along diagonals to an ungapped alignment and finally between diagonals to a full gapped alignment. BLAST BLAST is similar to FASTA in that it also performs sequence alignments and starts by making a hash table index from the query sequence using all words of length w that reach a certain threshold score when aligned to words of length w in the query. The database is then searched for these length w words, and extensions along the diagonal are created such that the local alignment stays above a threshold score. These local alignments are called HSPs and the highest scoring HSP is the MSP (maximal segment pair). Revisions have been made to the original BLAST algorithm: We can allow gaps by looking for two HSPs within a certain distance on the same diagonal. We can take BLAST hits from a query sequence and arrange them into a profile which shows frequencies of residues at each position. This profile can be used to search the database and then a new profile can be created. This process continues either leading to convergence in which eventually no new sequences are added or explosion in which all sequences are added to the profile. We also discussed issues involved in DNA and protein search principles. Sequence to Structure Structure prediction has not been tremendously successful but it employs many different methods. Transmembrane prediction is a fairly simple method for predicting which segments of proteins are found in membranes. This is accomplished by using a hydrophobicity scale for each amino acid and finding the average hydrophobicity score for a given window. If this score is above a certain threshold, then the segment is predicted to be in a membrane. Statistical methods have been developed to determine the likelihood of a certain residue being in a given secondary structure, for example in a transmembrane helix. One method involves first calculating the frequency of an amino acid in the database, fDB, then calculating the frequency of the amino acid in known helix regions, fHLX, and finally taking the logarithm of their ratios, ln (fHLX/fDB).
Nov 13 MG	Basic GOR GOR is a program that attempts to assign secondary structure to protein sequences. The basic idea is to look at the statistics of known structures to determine the probability that a particular residue is in a particular conformation. The conformation with the highest probability is then chosen. This is accomplished by including additional information from a sliding window of 17 neighboring residues to see how likely a secondary conformation is given these neighboring residues. GOR IV Armed with more structure data, the current version of GOR extends this basic idea by also comparing all pairs of residues within this window instead of just frequencies of single residues. It also includes semi-parametric improvements, or hacks, which filter out things like signal sequences and account for the fact that there cannot be a lone residue in a helix or strand conformation. One can envision extending GOR even further by incorporating larger windows or looking at triplets or even more combinations of residue patterns. However, even the current version of GOR does not perform perfectly because it never takes into account long range structural effects. Also, there is the idea of over-training the data so that it memorizes one set of data, but performs poorly on new data. Other Secondary Structure Prediction Methods DSC is another method which is similar to GOR, but improves on GOR by taking into account other physical factors of the residues into account and applying additional filters to get a better prediction than one based on sequence alone. Neural Networks have also been applied which can try to model and learn the underlying patterns of the structure, although it is much less intuitive to understand where exactly your prediction came from. Finally, there are many other methods that have been attempted including a Jury method which tries to combine multiple methods to give a more conclusive prediction. Nonetheless, all of these methods have their drawbacks and do not work perfectly. Fold Recognition Another way to predict the structure given a protein sequence is to try and fit the sequence to a structure or structural fold that is already known. Thus, fold recognition is the problem of aligning a sequence to a library of known structures in order to predict the structure, or fold, of that given sequence. One way to align a sequence to a structure is by Threading, which attempts to fit a sequence to a known structure. To evaluate how well a sequence aligns to a given structure, we use an Energy Function which gives a score to each alignment based on the environment and molecular interactions of the new structure while taking into account gaps and mismatches. Threading is useful because again we can use dynamic programming to quickly get the best alignment to a structure and which is obviously much faster than simulation based methods of structure prediction.
Nov 15 MG	Structures Protein geometry: Coordinates (X, Y, Z's), Derivative Concepts, relation to function, energies, dynamics Structure alignment RMS superposition, Optimal movement of one structure to minimize the RMS; Moving molecules rigidly Generalized similarity matrix; Threading Scoring structural similarity; Some similarities are readily apparent others are more subtle Other aspects of structure, besides just comparing atom positions Voronoi Volumes Packing Efficiency = V(VDW) / V(Voronoi); Small packing changes significant; Close-packing is default The Protein Surface Delauney triangulation, the natural way to define packing neighbors; Convex hull Richards' Molecular and Accessible Surfaces Packing defines the ``Correct Definition" of the Protein Surface Voronoi polyhedra are the Natural way to study packing
Nov 27 MG	Structure Simulation This lecture went through the basic ideas of describing the electric interactions between particles. The main concepts were the potential functions. The first is the electric potential equations for dipoles and multi-poles. The next are the Van Der Waals (VDW) forces and induced dipoles. Then, the classic spring model of particles was introduced. Potential energy can be calculated considering both VDW and electrostatics. Structure can then be minized to lower this potential energy.
Nov 29 MG	Structure Simulation Energy minimization can be carried out through a variety of techniques steepest descent, conjugate gradient, Newton-Raphson need to avoid local minima different techniques use varying levels of derivatives Adiabatic mapping finds low energy paths between two conformational states Molecular dynamics (MD) gives each atom a velocity and updates position and velocity at set time intervals mimics real motion Monte Carlo (MC) methods make random changes to atomic positions then randomly accepts or rejects these changes according to Boltzmann weighting can be very efficient at sampling a variety of conformations Practical aspects of simulations: effects of water must be taken into account macromolecules can be simulated inside box of water use periodic boundary conditions to avoid edge effects calculating non-bonded interactions takes the most time because of the long range Typically study long-term averages of simulations individual snapshot can be deceptive (because anything can happen) can calculate number density and radial distribution functions (RDFs) Simulations can be simplified proteins commonly simplified using lattice and off-lattice (discrete state) models
Dec 4 Kei Cheung	The lecture started of with a brief overview of databases; what are databases, some example of databases, the different components, data models, query language etc. The importance of normalization, query optimization and maintenance of the DB were also discussed. The major topic was the motivation and importance of DB integration. Some of the issues of DB integration were heterogeneity of the data in various DBs. One potential solution to help integration of DBs is Semantic Web. The last part of the lecture introduced the concept of Semantic Webs and concluded with an example of an implementation of a Semantic Web.

Discussion Sections

Lecture Summaries

Genomics Section

Bioinformatics Section

Related pages