Yale Genomics & Bioinformatics

CBB752/CPSC452/CPSC752/MBB452/MBB752/MCDB452/MCDB752

http://www.gersteinlab.org/courses/452/

Lecture Summaries

Genomics Section
Jan 14 DS	Overview of syllabus and topics covered in the course Genomics is the study of all genes in a genome as a whole Genomes vary greatly in terms of GC content, size, and gene density. Nanoarchaeum equitans has less than half a million bases and 552 genes Zea mays (corn) has about 5 billion bases and 60,000 genes Life can be divided into bacteria, archaea, and eucarya Some eukaryotes lost mitochondria at some point during their evolution Many extremophiles are archaea
Jan 16 MG	What is Bioinformatics? Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. Molecular Biology as an Information Science Sequence data Structure data Integrated data Large-scale Information Computer power Data volume Publication Organizing General Types of Bioinformatics Databases Text String Comparison Finding Patterns Geometry Physical Simulation Topics in Bioinformatics Genome Sequence Protein Sequence Sequence and Structure Databases Data Mining Functional Genomics Simulation Applications in Bioinformatics Drug Design Finding Homologs Genome Characterization
Jan 23 DS	Diversity of life No one can define how many organisms there are. Life is varied and organisms are very self-sufficient Live in different temperatures, pH levels, and different pressures Hard to maintain organisms in environment, use genomics to study the DNA sequence Understand the properties of proteins and RNA and how do they fold and what is their stability? Hyperthermophiles - the start of life? Crystallography is frequently done with hyperthermophilic proteins to study structure How do organisms biochemically differ? Membrane phospholipids are either ester-linked or ether-linked Ribosomal size, cell walls, nuclear membrane Proteins and nucleic acids Should understand proteins and be able to pull out information encoded from the DNA Adenine, Guanine, Cytosine, Uracil, and Thymine Modified nucleosides are found in RNA with known structure and more are being discovered DNA is stable and RNA is volatile Effects of torsion angle can have an affect on stability, like the sugar pucker of the furanose ring. It affects the distance between the phosphates DNA is anti-parallel and forms the major and minor groove in the helix Molecular biology tools Restriction enzymes, two types, protects the host from foreign DNA Type I – unspecific Type II – specific amino acids make up proteins 20 common amino acids Two additional amino acids may be co-translationally inserted (selenocysteine and pyrrolysine) Additional amino acids may be created by post-translational modifications, estimates are around 300 to 400 Don’t restrict thinking when working with the genome
Jan 28 DS	Two main secondary structural units found in proteins α-helix β-sheet Halophilic enzymes are irreversibly inactivated by low salt concentrations Features of thermostable enzymes Additional salt-bridges Shorter loop regions Additional aromatic interactions Stabilizing substitutions in α-helices Reduced surface area The longer a DNA sequence the higher the chance that it is found only once in a genome Electrophoresis is used to separate molecules of different sizes Sanger sequencing Randomly terminating single strands with dideoxynucleotides Manual read on gel Automatic reading with 4 different fluorescent dyes Pyro sequencing when sequence incorporates nucleotide it gives off light Genome sequencing approaches Hierchical Shotgun
Jan 30 DS	Two major types of sequencing – Sequential and Parallel Sanger Method (capillary electrophoresis) – ABI 3730 Sequence resolved at end of reaction (not in real-time) terminate synthesis with fluorescently-labelled dideoxynucleotides Requires cloning and library preparation from template, which is a time-consuming step. Pyrosequencing – 454 400,000 reads per run, 250 bp per read, 100 Mb/run, 7 hours no cloning Massively parallel. Each fragment/bead/well is being sequenced at the same time. Sequencing by synthesis – Solexa/Illumina 1 GB of data per run, 35 bp/read, 4 days/run Also massively parallel Sequencing by Ligation (ABI's SOLiD) 40 million reads/run, 1-2 Gb/run/slide (4 days), 35 bp/read Emulsion PCR and beads Two bases queried at a time EXREMELY accurate sequence Massively parallel Good for SNP detection and verification at this stage. General Sequencing Notes New methods expected to play a role in the 1,000 genomes project and in the field of personal genomics. Ethical concerns of genomic data Issues in Sequencing Completeness/Coverage - How deep is the coverage of the region being sequenced? Accuracy - What is the error rate? Validity of the assembly - Regions can be correctly sequenced but placed in the incorrect order Annotation Challenges Determining orthologs/paralogs Biochemical/Genetic characterization Sequencing errors can and do lead to incorrect annotations Computational task cDNA – The DNA created by reverse-transcribing mRNA. Essentially it is DNA without introns. ESTs (expressed sequence tags)– partial DNA sequences from the 5' or 3' end of a transcript that represent different exons in different clones from the same gene.
Feb 4 DS	The ultimate goal of structural genomics is to solve the structure of all available proteins using high-throughput methods. Protein Structure Initiative: Federal, university, and industry effort that seeks to reduce the cost and the time it takes to determine the three-dimensional structure of proteins. X-ray crystallography Newer methods such as MAD (multiwavelength anomalous diffraction) do not require the soaking of crystals in a heavy-atom containing solution. Instead the targeted protein is expressed in a methionine auxotroph that is cultured in Seleno-methionine rich media, which contains Selenium atoms. Recently, crystallization formulations for efficient screening of protein crystallization condition have been developed + High resolution + Structure determination of large molecules possible - Protein must crystallize (Membrane protein, large protein assemblies) NMR + Study the dynamics of molecules - Lower resolution - Restricted to small molecules - Requires large amounts of proteins From genomes to high-resolution protein structures: Genomes Target selection Gene cloning and expression Protein production and purification Crystal production Structure determination and refinement Model validation and fold/function analysis Dissemination of data (PDB deposit) PSI Assessment Fold space may be near complete coverage Sequence space is still growing linearly Many solved structures have no associated biological function (structure of many apoproteins that have no bound substrate)
Feb 6 DS	Trends in Biotech: Genetic analysis: SNP, Mutation, Sequencing Expression Analysis: mRNA, Protein Interaction Analysis: Protein-Protein, Antigen-Antibody, Enzyme-Substrate, Protein-DNA, Ligand-Receptor Transcription is the process responsible for DNA to mRNA; it is influenced by promoter activity, repression/attenuation, induction, DNA methylation/Chromatin remodeling Translation mediates mRNA conversion to protein; it is influenced by mRNA lifetime, codon usage and tRNA levels, Ribosome binding, Alternative splicing, and RNAi Metabolomics studies how proteins affect metabolites and functional interactions Transcriptomics: analysis of when, where, and how much each gene is expressed Genomic analysis of a cell/organism involves stimulating the cell, changing the cell state and critical determination and examination of crucial/causing events (nodes in the tree of causation) For Affymetrix DNA microarrays: need sequence; have to make cDNA; isolate mRNA from any organism Annealing is concentration-depndent; abundant genes anneal each other Affymetrix GeneChip Technology: RNA->RT->cDNA->Labelled Transcripts->Labelled Fragments
Feb 11 DS	Proteomics Can be broken down into sub-fields of Post-transcriptional modifications Protein-protein interactions Structural proteomics Functional proteomics Proteome Mining Protein Expression Profiling Proteins can be identified using Electrophoretic separation Protein fingerprinting with mass spectrometry Differential expression can be detecting using ICAT labeling ¹⁵N labeling Fluxome measures the movement of small molecules through cellular pathways Chemical Genomics The systematic use of small molecules to explore biology Aims to understand the full range of protein classes and what types of small molecules interact with each For example, what proteins and pathways does a given drug affect Metagenomics Genomic analysis of environmental sequences For example, analyzing all the DNA in a cup of soil from the Bass courtyard
Feb 13 MS	Gene Inactivation Pro: One of the best ways to learn function of gene Con: Often can result in no phenotype, either because of gene redundancy, subtle effect, or inappropriate assay conditions. Difficult to ascertain primary vs. secondary effects. 3 methods of gene inactivation: 1. Insertional mutagenesis Transposon knockout made by uncoupling transposase activity from natural transposons and inserting gene marker. Can use on genome wide scale to deduce “minimal genome” Primary assumption is that insertion of the transposon will eliminate full gene function; also problematic reaching full saturation Gene traps: create gene fusion proteins using a marker lacking an ATG start 2. Systematic K.O. using selectable markers Can use “barcodes” (20 base sequences) to uniquely identify genes and put onto a microarray to follow the phenotype of many strains at once Relies on annotated sequence but gives clear null phenotypes; expensive but comprehensive Both methods 1 & 2 have been used successfully to ascertain drug targets as well as gene function 3. RNAi dsRNAs can be processed to 21nt ss siRNAs that target complementary mRNAs for degradation/prevent translation Methods of introduction: direct transfection of siRNA insertion of shRNAs using retroviral constructs in the case of C. elegans, dsRNAs can be synthesized by cloning genes into E. coli plasmids and fed to worms. Pros: applicable to many organisms since RNAi is well conserved, cheap and easy to use comprehensively, can knockdown gene family expression Cons: only knockdown, non-specific effects, some genes not affected
Feb 18 MS	Protein-protein interaction screens have many biological applications including the construction of complete pairwise interaction maps and regulatory networks, the characterization of large complexes, the identification of novel functions for biological molecules, and molecular characterization of disease states. Yeast two-hybrid Pairwise Screening Bipartate transcription factors. Protein A fused to DNA binding domain, protein B fused to activation domain. Interaction between A and B initiates the transcription of selectable marker. Transform yeast by homologous recombination Haploid a and ? mating types carrying constructs mated to systematically test 6200 x 6200 interactions. Large scale screening 64 bait pools containing 96 constructs each systematically combined 64 prey pools with 96 constructs each Use selective media to screen for matings that transcribe the marker gene Advantages In vivo assay Simple to design experiments No need to purify proteins, everything done by cloned constructs Disadvantages Fusion proteins interacting in the nucleus is an artificial system Large number of false positives and not much overlap between results of different screens Hard to test on large scale. Pooled screen does not test all interactions equally Affinity Tagging/Mass Spectrometry TAP- Tandem Affinity Purification Tagging Gene of interest is fused to CAM-binding domain, TEV and Protein A Complexes containing protein of interest are purified by binding of Protein A to IgG beads. Complexes eluted by TEV proteolysis Complexes purified again by binding to CAM beads Eluted with EGTA Identification of Proteins by Mass Spec Complex loaded on SDS-PAGE denaturing gel Excise bands, trypsin digest and run fragment on Mass Spec to identify protein. Advantages Assay performed in vivo under native conditions Identify whole complex not just pairwise interactions Disadvantages Identifies indirect interactions within a complex False positives and negatives by missing rare components or inclusion of contaminants Protein Chip Antibody Microarray Antibodies to thousands of proteins spotted onto a chip bind to specific epitopes on applied proteins with very high affinity Disadvantage- antibodies for all proteins are not available Functional Protein Microarray Can assay for protein-protein interactions as well as small molecule and enzyme-substrate, and kinase assays Use an inducible promoter in yeast to drive expression of a large quantity of each protein and print onto proteome chips Proteins, lipids, nucleic acids, or small molecules applied to the chip Molecule of interest is then probed with fluorescent marker to identify which proteins spots it is bound to Advantages High-throughput parallel screening Uses small amounts of proteins and other molecules Many different biochemical applications Disadvantages In vitro assay, must be validated Requires a large high quality expression library
Feb 20 Chris Noren NE Biolabs	Combinatorial Biology Systematic exploration of variations of a heterobiopolymer Example found naturally: immune system diversity Immune system diversity results from combinatorial recombination of VDJ (variable, diversity, joining) segments ~107 potential variable regions Other examples from past studies Protein engineering: combinatorial mutagensis followed by genetic selection Converting penicillin-binding protein (PBP) into a β-lactamase In vitro evolution of combinatorial biopolymer libraries Systematic evolution of ligands by exponential enrichment Enriched through this process, go through about 8-15 rounds, sometimes more Useful for mapping sequence requirements for protein DNA/RNA interactions, discovery of novel ligands for small molecules, surfaces, or proteins, and observing the evolution of novel ribozymes In vitro selection of a ribozyme with polynucleitide kinase activity Coupling catalytic activity to replication Natural selection in a test tube: amplification more rapid form round to round Phage Display Technology Diversity in old methods limited to <106 because of spatial addressing New method: Filamentous phage display of combinatorial peptide libraries Libraries of high complexity (>109 clones) can be prepared Ligands are selected by iterating through sequential rounds of panning/amplification Applications Peptide Libraries: vaccine development, enzyme inhibitors... basically any study requiring large amounts of different peptides Protein Libraries: creating antibody libraries, screening library of mutants of a single protein to select for altered specificity 3 Peptide libraries: Ph.D-7, -12, -C7C Much faster: epitope mapping of anti-β-endorphin MAb, result was yielded in 6 days, as opposed to weeks of "traditional" epitope mapping Combinatorial Chemistry Semisynthetic phage display libraries Phage display can easily identify "hits" from complex libraries, but peptides have limited functional diversity (20 "canonical" amino acids) Combinatorial chemistry can incorporate extremely diverse functionalities, but more difficult to identify hits Combine the previous two and use chemically modified phage display Expand functional diversity of displayed molecules by performing reactions on the peptides prior to each round of panning Depends on a unique reactive site for modification Selenocysteine (Sec) Se is 99% deprotonated at neutral pH Incorporated into proteins in all three kingdoms of life; a sort of 21st amino acid Prokaryotic Sec incorporation by translational recoding (context-dependent opal suppression) UGA as a stop codon and Sec codon Phage-displayed selenopeptide library Application: selection of optimal chimeric ligands Application: direct mechanical manipulation of M13 phage Hold streptavidin-coated microsphere in optical trap Wormlike Chain model for polymer stretching Equation to compute contour length and persistence length Was found that DNA is very flexible, while phage much stiffer than dsDNA of comparable length Application: in vivo selection system for Sec insertion requirements Differential chemical reactivity of Cys and Sec in displayed peptides assayed by biotinylation
Feb 27 Patrick O’Donoghue	Phylogenomics Infers phylogeny from data taken from whole genomes. Important for understanding the tree of life, evolution and properties of pathogens, human evolution... Background Charles Darwin proposed and sketched a theoretical "tree of life" in Origin of Species (1859)—evolution proceeds through diversification, so all life can be traced back to a single source. Early ideas not based on data: three kingdoms (1866), five kingdoms (1968), etc. Carl Woese et al., 1977: three domains: Eukaryota, Bacteria, Archaea. Based on molecular data. Ribosomal RNA (rRNA) is universal among cellular organisms and highly conserved. Variations: unrooted tree (1993); "net of life" (2005). Molecular phylogenetics provides data from comparisons of gene and protein sequences to help deduce the evolutionary relationships between organisms. Supports the rRNA tree of life but complicates and enriches it. Paralogs show a gene history that extends back beyond LUCA. Non-rRNA-based phylogenies. Every gene has its own history. Different genes evolve at different rates. Proteins can be compared by sequence (identity or similarity) and structure. Structural homologies can tell us more about the earliest reaches of the history of life. Examples of aminoacyl-RNA synthetases (aaRSs). Relationships between branches of the phylogenetic tree show common ancestries but also differences in the rate of evolution of different genes. Methods of building trees: Clustering algorithms. Optimality methods: parismony, maximum likelihood, heuristic search. Protein domains and superfamilies contribute to phylogenetic understanding: presence/absence and evolutionary distance. Genome conservation Weights the average sequence similarity by number of homologs between genomes. Reinforces canonical rRNA tree. Horizontal Gene Transfer (HGT) complicates tree of life—the "net of life." Pathogen evolution—tracks changes in virulence and geographical spread. Human evolution: phylogenomics may help us better understand how and why we differ from macacques and chimpanzees. Different genes have evolved at different rates—we are closely related, but some genes may have changed rapidly. Horizontal Gene Transfer Evidence that genes are transferred between bacteria and archaea. Something similar in eukaryotes? — Endogenous retroviruses (ERVs) and transposons in humans. Phylogenomics Provides a sensitive measure of phylogenetic relationships, and different rates of gene evolution. Represents explicitly both vertical and horizontal gene transfer.
Mar 5 MG	Discussion topic: What is Bioinformatics, what are things you need to know and what are some peripheral topics? Concentrating on sequences and comparisons Basic steps for alignments make a dot plot compute the matrix of sums trace back through the sums to find alignment – there can be alternate tracebacks, which will give different alignments Gaps – want to be able to account for possible gaps in a alignment, so introduce a gap penalty Gap = a +bN a = opening a gap b = extending a gap N= length of the gap Should be able to align 4-mers Dynamic Programming – idea is to solve a sub problem and keep adding more solved sub problems to it to make it more efficient, won’t have to constant go back to solve the same problem over again Similarity (substitution) matrix – can be more tricky Can be used to align things like protein structures and alignments Two common matrices are PAM and BLOSSUM BLOSUM 62 is default for the BLAST matrix Two different types of alignments Needleman-Wunsh (global alignment) –best alignment for the entirety of both sequences Smith-Waterman (local alignment) – best alignment of segments without regard to the rest of the sequences Use negative scores for a mismatch Have the min score in the matrix be zero Can find best score anywhere in the matrix, not just the last column or row
Mar 24 MG	multiple sequence alignments attempting to use dynamic programming is too computationally complex instead use progressive multiple alignments do a pairwise alignment for the most closely related sequences replace this pair of sequences with a single sequence that represents their alignment repeat until all sequences are aligned results in a tree of sequences not necessarily optimal, but should be close motifs are regular expression representations of sequences can represent more complex ideas then simple sequence alignments Hidden Markov models comprised of match states, each corresponding to a column in a multiple alignment each state has probabilities for emitting symbols (corresponding to amino acids or nucleotides) and probabilities to transition to other states (including itself) gaps and insertions can easily be modeled Common Algorithms: Forward: Summing over the probabilities of all paths that emit a given sequence. Viterbi: Finds the most probable path through the model for a given sequence. Scores formula for score: S = ΣS(i,j) – nG scores are most meaningful when put in relation to other scores, such as P values scores typically follow extreme value distributions ROC curve (Receiver operating characteristic curve) shows sensitivity vs. specificity
Mar 26 Kei Cheung	The Human Genome Project ushered a new era in biology, shifting the focus from wet bench biology to computational biology. Three different “versions” of the web: Web 1.0 – The original WWW Pages designed for a single person/group Read-only Static Document-centric, html-based Human-readable Web 2.0 More collaborative (social networking) Added multimedia content Podcasts, videos, etc. Pages now interactive/dynamic Web services Ability to create “data mashups” from distinct sources Yahoo Pipes is one such method to create mashups Examples: Cancer vs. water pollution map overlay, Avian flu outbreak map using Google Earth, etc. Can mix different data types to form one cohesive analysis Community-based XML Web 3.0 Semantic web Common framework RDF URI and graph structure Data represented by triples (subject, predicate, value) Focus on machine-readable metadata Ontologies Representation of concepts and their relationships Topic map ISO standard A different triple (topics, associations, and occurrences) Organizes information in a way that can be optimized for navigation XTM and TMQL as the key languages Benefits to Web 3.0 Human and machine readable data Social and semantic network Syntactic and semantic mashup Platform for data/tool sharing and integration, scientific collaboration, and e-learning. More use cases needed to convince scientists that is is worth the extra startup effort for these types of projects Problems with Web 1.0 and 2.0 Lack of annotation Lack of links Lack of link semantics Lack of data semantics Lack of standards People don’t adhere to existing standards
Mar 31 MG	Definition of TN,TP,FN,FP TP = Predicted positive and actually positive TN = Predicted negative and actually negative FP = Predicted positive but actually negative FN = Predicted negative but actually positive AUC (area under the curve, under the ROC curve) Specificity = TN/(TN+FP) Sensitivity = TP/(TP+FN) Score becomes less significant with increase in database size Simpler sequence regions have smaller significance Computational complexity NW alignment O(n³) time and O(n²) space FASTA makes hash table of short words BLAST extends hash word hits without any gaps while the extension is favorable BLAST 2 joins words like FASTA PSI BLAST Do one blast search Get results Use to build profile Search with profile as query Iterate Sequence to Structure Secondary structure prediction Predicting transmembrane helices
Apr 2 MG	TM Prediction Window average employed to predict GES Peak --> helix? Verify by transforming graph Probability that a residue has a secondary structure: Scale based of db freq vs. exp freq: log(odds) transformation Freq of particular residue/general freq= desired freq Find gold standard in training set: A in DB/ A in total Cell cytoplasm is more negative than outside of cell GOR (Parametrical Statistical Prediction) Classic method of predicting secondary structure Early method; adding up a bunch of log-odds How much info is produced abt a particular residue to be in a particular location Independent events; frequency of residue affects position of helix Pr(helix) relative to not being helix (odds ratio) Issue: going to have to construct lots of probability bins 3 diff positions * 17 positions * 20amino acids Training sets is what we observe Asp beta-predictor Pro one down --> low tendency to be in helix N-terminal positive, C-terminal negative --> higher tendency of positive Lysine at the C-terminal Mini GOR: 2 residue windows, 3 amino acids Assessment: Helices, strands and coils are not equi-abundant; more helices in proteins than coil Certain types of measurements can be differentially penalized Q3 assessment reveals that simple GOR gets 65% but sometimes under predicts... Over training: number in training sets vs. number in test set... be careful to report only what’s in the test set... Parameterizing choice on test set is not a good practice Nonparametric Predictors; Semi-parametric; single sequence and multiple sequence are other methods of predicting secondary structure GOR Semi-Parametric: filtering GOR
Apr 14 MG	Protein Geometry: Based on X,Y,Z coordinates Derivative concepts: Distance Surface area Volume Axes Angle Relates to energy functions and dynamics Comparing Structures: Given a structure A and B compare their structure by performing the following operations: Given an alignment, optimally superimpose A onto B Involves a rotation and a translation step Minimize the root mean square (RMS) distance between A and B (consisting of 6 parameters) Find an alignment between A and B based on their 3D coordinates Use a similarity matrix which depends on the 3D coordinates Threading: An entry in the similarity matrix Sij depends on the how well the amino acid at position i in protein 1 fits into the 3D structural environment at position j of protein 2 Structural alignment violates the central idea of dynamic programming because the calculation of the alignment of residue i affects the previous optimal alignments This issue can be surmounted by using an iterative approach as outlined below: Compute similarity matrix Align via dynamic programming RMS fit based on alignment Move Structure B Re-compute similarity If changed from #1, GOTO #2
Apr 16 MG	Other Aspects of Structure, Besides just Comparing Atom Positions Atom Position, XYZ triplets Lines, Axes, Angles Surfaces, Volumes Voronoi Volumes Each atom surrounded by a single convex polyhedron and allocated space within it Packing efficiency: V(VDW) / V(Voronoi) Volumes are directly related to packing efficiency Missing atoms give looser packing Problem of protein surface: sometimes not enough atoms to create proper polyhedron Problem: not all atoms same size Solution: type the atoms Small packing changes significant Protein Surface Delauney triangulation (rough, CS style) Natural way to define packing neighbors Convex hull Richards’ molecular and accessible surfaces (smooth, chem style) Packing defines the “correct definition” of the protein surface Voronoi polyhedra are the “natural” way to study packing Simulation Electrostatics + basic forces Potential functions Maxwell’s equations Relating electric fields with magnetic fields Packing in terms of VDW forces Potential energy in bond length springs Torsion angles Springs and electrical forces contribute to potential energy Conventional macromolecular potential functions are simplified in order to make it easier to create models Ex: bonds as springs
Apr 21 MG	Basic Protein Model Geometry Coordinates Volume Surface area Energetics Charges for electrical forces Force constants for springs Potential Function Dynamics Mass and time Dynamics F = m dv/dt Methods to move along the energy surface and find best (minimal) structure Steepest descent minimization Can get stuck in local minima Follows gradient of energy downhill Other methods of minimization Conjugate gradient Newton-Raphson Molecular Dynamics Gives atoms velocities and updates their movement Simulates real protein motion Ergodic Assumption Eventually trajectory visits every state in phase space Boltzmann weighting The model will spend more time in low energy states than in higher energy states Mark Chain Monte Carlo Moves randomly through sates accepting next moves based on Boltzmann weighting MD is more used for proteins and MCMC is more used for liquids Simulated annealing Begin with large steps through space (high heat) Gradually decrease size of steps (cooling down) Simplifying simulations Neighbor list for atoms or max distance for neighbors set Divide atoms into types Initially Associate each atom with mass and point charge Give each atom initial velocity Calculate potential Update positions using MD If needed using mirror images for neighboring systems

Discussion Sections

Lecture Summaries

Genomics Section

Related pages