CBB752/CPSC452/CPSC752/MBB452/MBB752/MCDB452/MCDB752

http://www.gersteinlab.org/courses/452/

Lecture Summaries

Genomics Section

Jan 14

DS

  • Overview of syllabus and topics covered in the course
  • Genomics is the study of all genes in a genome as a whole
  • Genomes vary greatly in terms of GC content, size, and gene density.
    • Nanoarchaeum equitans has less than half a million bases and 552 genes
    • Zea mays (corn) has about 5 billion bases and 60,000 genes
  • Life can be divided into bacteria, archaea, and eucarya            
    • Some eukaryotes lost mitochondria at some point during their evolution
    • Many extremophiles are archaea

Jan 16

MG

What is Bioinformatics?

  • Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

Molecular Biology as an Information Science

  • Sequence data
  • Structure data
  • Integrated data

Large-scale Information

  • Computer power
  • Data volume
  • Publication

Organizing

General Types of Bioinformatics

  • Databases
  • Text String Comparison
  • Finding Patterns
  • Geometry
  • Physical Simulation

Topics in Bioinformatics

  • Genome Sequence
  • Protein Sequence
  • Sequence and Structure
  • Databases
  • Data Mining
  • Functional Genomics
  • Simulation

Applications in Bioinformatics

  • Drug Design
  • Finding Homologs
  • Genome Characterization

Jan 23

DS

Diversity of life

  • No one can define how many organisms there are.
  • Life is varied and organisms are very self-sufficient
  • Live in different temperatures, pH levels, and different pressures
  • Hard to maintain organisms in environment, use genomics to study the DNA sequence
  • Understand the properties of proteins and RNA and how do they fold and what is their stability?
  • Hyperthermophiles - the start of life?
  • Crystallography is frequently done with hyperthermophilic proteins to study structure

How do organisms biochemically differ?

  • Membrane phospholipids are either ester-linked or ether-linked
  • Ribosomal size, cell walls, nuclear membrane

Proteins and nucleic acids

  • Should understand proteins and be able to pull out information encoded from the DNA
  • Adenine, Guanine, Cytosine, Uracil, and Thymine
  • Modified nucleosides are found in RNA with known structure and more are being discovered
  • DNA is stable and RNA is volatile
  • Effects of torsion angle can have an affect on stability, like the sugar pucker of the furanose ring. It affects the distance between the phosphates
  • DNA is anti-parallel and forms the major and minor groove in the helix

Molecular biology tools

  • Restriction enzymes, two types, protects the host from foreign DNA
    • Type I – unspecific
    • Type II – specific
  • amino acids make up proteins
  • 20 common amino acids
  • Two additional amino acids may be co-translationally inserted (selenocysteine and pyrrolysine)
  • Additional amino acids may be created by post-translational modifications, estimates are around 300 to 400
  • Don’t restrict thinking when working with the genome

Jan 28

DS

  • Two main secondary structural units found in proteins
    • α-helix
    • β-sheet
  • Halophilic enzymes are irreversibly inactivated by low salt concentrations
  • Features of thermostable enzymes
    • Additional salt-bridges
    • Shorter loop regions
    • Additional aromatic interactions
    • Stabilizing substitutions in α-helices
    • Reduced surface area
  • The longer a DNA sequence the higher the chance that it is found only once in a genome
  • Electrophoresis is used to separate molecules of different sizes
  • Sanger sequencing
    • Randomly terminating single strands with dideoxynucleotides
    • Manual read on gel
    • Automatic reading with 4 different fluorescent dyes
  • Pyro sequencing when sequence incorporates nucleotide it gives off light
  • Genome sequencing approaches
    • Hierchical
    • Shotgun

Jan 30

DS

  • Two major types of sequencing – Sequential and Parallel
    • Sanger Method (capillary electrophoresis) – ABI 3730
      • Sequence resolved at end of reaction (not in real-time)
      • terminate synthesis with fluorescently-labelled dideoxynucleotides
      • Requires cloning and library preparation from template, which is a time-consuming step.
    • Pyrosequencing – 454
      • 400,000 reads per run, 250 bp per read, 100 Mb/run, 7 hours
      • no cloning
      • Massively parallel. Each fragment/bead/well is being sequenced at the same time.
    • Sequencing by synthesis – Solexa/Illumina
      • 1 GB of data per run, 35 bp/read, 4 days/run
      • Also massively parallel
    • Sequencing by Ligation (ABI's SOLiD)
      • 40 million reads/run, 1-2 Gb/run/slide (4 days), 35 bp/read
      • Emulsion PCR and beads
      • Two bases queried at a time
      • EXREMELY accurate sequence
      • Massively parallel
      • Good for SNP detection and verification at this stage.
  • General Sequencing Notes
    • New methods expected to play a role in the 1,000 genomes project and in the field of personal genomics.
    • Ethical concerns of genomic data
    • Issues in Sequencing
      • Completeness/Coverage - How deep is the coverage of the region being sequenced?
      • Accuracy - What is the error rate?
      • Validity of the assembly - Regions can be correctly sequenced but placed in the incorrect order
    • Annotation Challenges
      • Determining orthologs/paralogs
      • Biochemical/Genetic characterization
      • Sequencing errors can and do lead to incorrect annotations
      • Computational task
  • cDNA – The DNA created by reverse-transcribing mRNA. Essentially it is DNA without introns.
  • ESTs (expressed sequence tags)– partial DNA sequences from the 5' or 3' end of a transcript that represent different exons in different clones from the same gene.

Feb 4

DS

  • The ultimate goal of structural genomics is to solve the structure of all available proteins using high-throughput methods.
  • Protein Structure Initiative: Federal, university, and industry effort that seeks to reduce the cost and the time it takes to determine the three-dimensional structure of proteins.
  • X-ray crystallography
    • Newer methods such as MAD (multiwavelength anomalous diffraction) do not require the soaking of crystals in a heavy-atom containing solution. Instead the targeted protein is expressed in a methionine auxotroph that is cultured in Seleno-methionine rich media, which contains Selenium atoms.
    • Recently, crystallization formulations for efficient screening of protein crystallization condition have been developed
    • + High resolution
    • + Structure determination of large molecules possible
    • - Protein must crystallize (Membrane protein, large protein assemblies)
  • NMR
    • + Study the dynamics of molecules
    • - Lower resolution
    • - Restricted to small molecules
    • - Requires large amounts of proteins
  • From genomes to high-resolution protein structures:
    • Genomes
    • Target selection
    • Gene cloning and expression
    • Protein production and purification
    • Crystal production
    • Structure determination and refinement
    • Model validation and fold/function analysis
    • Dissemination of data (PDB deposit)
  • PSI Assessment
    • Fold space may be near complete coverage
    • Sequence space is still growing linearly
    • Many solved structures have no associated biological function (structure of many apoproteins that have no bound substrate)

Feb 6

DS

  • Trends in Biotech:
    • Genetic analysis: SNP, Mutation, Sequencing
    • Expression Analysis: mRNA, Protein
    • Interaction Analysis: Protein-Protein, Antigen-Antibody, Enzyme-Substrate, Protein-DNA, Ligand-Receptor
  • Transcription is the process responsible for DNA to mRNA; it is influenced by promoter activity, repression/attenuation, induction, DNA methylation/Chromatin remodeling
  • Translation mediates mRNA conversion to protein; it is influenced by mRNA lifetime, codon usage and tRNA levels, Ribosome binding, Alternative splicing, and RNAi
  • Metabolomics studies how proteins affect metabolites and functional interactions
  • Transcriptomics: analysis of when, where, and how much each gene is expressed
  • Genomic analysis of a cell/organism involves stimulating the cell, changing the cell state and critical determination and examination of crucial/causing events (nodes in the tree of causation)
  • For Affymetrix DNA microarrays: need sequence; have to make cDNA; isolate mRNA from any organism
  • Annealing is concentration-depndent; abundant genes anneal each other
  • Affymetrix GeneChip Technology: RNA->RT->cDNA->Labelled Transcripts->Labelled Fragments

Feb 11

DS

Proteomics

  • Can be broken down into sub-fields of
    • Post-transcriptional modifications
    • Protein-protein interactions
    • Structural proteomics
    • Functional proteomics
    • Proteome Mining
    • Protein Expression Profiling
  • Proteins can be identified using
    • Electrophoretic separation
    • Protein fingerprinting with mass spectrometry
  • Differential expression can be detecting using
    • ICAT labeling
    • 15N labeling
  • Fluxome measures the movement of small molecules through cellular pathways

Chemical Genomics

  • The systematic use of small molecules to explore biology
  • Aims to understand the full range of protein classes and what types of small molecules interact with each
  • For example, what proteins and pathways does a given drug affect

Metagenomics

  • Genomic analysis of environmental sequences
  • For example, analyzing all the DNA in a cup of soil from the Bass courtyard

Feb 13

MS

  • Gene Inactivation
    • Pro: One of the best ways to learn function of gene
    • Con: Often can result in no phenotype, either because of gene redundancy, subtle effect, or inappropriate assay conditions. Difficult to ascertain primary vs. secondary effects.
  • 3 methods of gene inactivation:
    • 1. Insertional mutagenesis
      • Transposon knockout made by uncoupling transposase activity from natural transposons and inserting gene marker.
        • Can use on genome wide scale to deduce “minimal genome”
        • Primary assumption is that insertion of the transposon will eliminate full gene function; also problematic reaching full saturation
      • Gene traps: create gene fusion proteins using a marker lacking an ATG start
    • 2. Systematic K.O. using selectable markers
      • Can use “barcodes” (20 base sequences) to uniquely identify genes and put onto a microarray to follow the phenotype of many strains at once
      • Relies on annotated sequence but gives clear null phenotypes; expensive but comprehensive
    • Both methods 1 & 2 have been used successfully to ascertain drug targets as well as gene function
    • 3. RNAi
      • dsRNAs can be processed to 21nt ss siRNAs that target complementary mRNAs for degradation/prevent translation
      • Methods of introduction:
        • direct transfection of siRNA
        • insertion of shRNAs using retroviral constructs
        • in the case of C. elegans, dsRNAs can be synthesized by cloning genes into E. coli plasmids and fed to worms.
      • Pros: applicable to many organisms since RNAi is well conserved, cheap and easy to use comprehensively, can knockdown gene family expression
      • Cons: only knockdown, non-specific effects, some genes not affected

Feb 18

MS

Protein-protein interaction screens have many biological applications including the construction of complete pairwise interaction maps and regulatory networks, the characterization of large complexes, the identification of novel functions for biological molecules, and molecular characterization of disease states.


Yeast two-hybrid

  • Pairwise Screening
    • Bipartate transcription factors. Protein A fused to DNA binding domain, protein B fused to activation domain. Interaction between A and B initiates the transcription of selectable marker.
    • Transform yeast by homologous recombination
    • Haploid a and ? mating types carrying constructs mated to systematically test 6200 x 6200 interactions.
  • Large scale screening
    • 64 bait pools containing 96 constructs each systematically combined 64 prey pools with 96 constructs each
    • Use selective media to screen for matings that transcribe the marker gene
  • Advantages
    • In vivo assay
    • Simple to design experiments
    • No need to purify proteins, everything done by cloned constructs
  • Disadvantages
    • Fusion proteins interacting in the nucleus is an artificial system
    • Large number of false positives and not much overlap between results of different screens
    • Hard to test on large scale. Pooled screen does not test all interactions equally

Affinity Tagging/Mass Spectrometry

  • TAP- Tandem Affinity Purification Tagging
    • Gene of interest is fused to CAM-binding domain, TEV and Protein A
    • Complexes containing protein of interest are purified by binding of Protein A to IgG beads.
    • Complexes eluted by TEV proteolysis
    • Complexes purified again by binding to CAM beads
    • Eluted with EGTA
  • Identification of Proteins by Mass Spec
    • Complex loaded on SDS-PAGE denaturing gel
    • Excise bands, trypsin digest and run fragment on Mass Spec to identify protein.
  • Advantages
    • Assay performed in vivo under native conditions
    • Identify whole complex not just pairwise interactions
  • Disadvantages
    • Identifies indirect interactions within a complex
    • False positives and negatives by missing rare components or inclusion of contaminants

Protein Chip

  • Antibody Microarray
    • Antibodies to thousands of proteins spotted onto a chip bind to specific epitopes on applied proteins with very high affinity
    • Disadvantage- antibodies for all proteins are not available
  • Functional Protein Microarray
    • Can assay for protein-protein interactions as well as small molecule and enzyme-substrate, and kinase assays
    • Use an inducible promoter in yeast to drive expression of a large quantity of each protein and print onto proteome chips
    • Proteins, lipids, nucleic acids, or small molecules applied to the chip
    • Molecule of interest is then probed with fluorescent marker to identify which proteins spots it is bound to
  • Advantages
    • High-throughput parallel screening
    • Uses small amounts of proteins and other molecules
    • Many different biochemical applications
  • Disadvantages
    • In vitro assay, must be validated
    • Requires a large high quality expression library

Feb 20

Chris Noren

NE Biolabs

Combinatorial Biology

  • Systematic exploration of variations of a heterobiopolymer
  • Example found naturally: immune system diversity
    • Immune system diversity results from combinatorial recombination of VDJ (variable, diversity, joining) segments
    • ~107 potential variable regions
  • Other examples from past studies
    • Protein engineering: combinatorial mutagensis followed by genetic selection
      • Converting penicillin-binding protein (PBP) into a β-lactamase
    • In vitro evolution of combinatorial biopolymer libraries
      • Systematic evolution of ligands by exponential enrichment
      • Enriched through this process, go through about 8-15 rounds, sometimes more
      • Useful for mapping sequence requirements for protein DNA/RNA interactions, discovery of novel ligands for small molecules, surfaces, or proteins, and observing the evolution of novel ribozymes
    • In vitro selection of a ribozyme with polynucleitide kinase activity
    • Coupling catalytic activity to replication
      • Natural selection in a test tube: amplification more rapid form round to round
  • Phage Display Technology
    • Diversity in old methods limited to <106 because of spatial addressing
    • New method: Filamentous phage display of combinatorial peptide libraries
      • Libraries of high complexity (>109 clones) can be prepared
      • Ligands are selected by iterating through sequential rounds of panning/amplification
    • Applications
      • Peptide Libraries: vaccine development, enzyme inhibitors... basically any study requiring large amounts of different peptides
      • Protein Libraries: creating antibody libraries, screening library of mutants of a single protein to select for altered specificity
    • 3 Peptide libraries: Ph.D-7, -12, -C7C
    • Much faster: epitope mapping of anti-β-endorphin MAb, result was yielded in 6 days, as opposed to weeks of "traditional" epitope mapping

Combinatorial Chemistry

  • Semisynthetic phage display libraries
    • Phage display can easily identify "hits" from complex libraries, but peptides have limited functional diversity (20 "canonical" amino acids)
    • Combinatorial chemistry can incorporate extremely diverse functionalities, but more difficult to identify hits
    • Combine the previous two and use chemically modified phage display
      • Expand functional diversity of displayed molecules by performing reactions on the peptides prior to each round of panning
      • Depends on a unique reactive site for modification
  • Selenocysteine (Sec)
    • Se is 99% deprotonated at neutral pH
    • Incorporated into proteins in all three kingdoms of life; a sort of 21st amino acid
    • Prokaryotic Sec incorporation by translational recoding (context-dependent opal suppression)
      • UGA as a stop codon and Sec codon
    • Phage-displayed selenopeptide library
      • Application: selection of optimal chimeric ligands
    • Application: direct mechanical manipulation of M13 phage
      • Hold streptavidin-coated microsphere in optical trap
      • Wormlike Chain model for polymer stretching
        • Equation to compute contour length and persistence length
      • Was found that DNA is very flexible, while phage much stiffer than dsDNA of comparable length
    • Application: in vivo selection system for Sec insertion requirements
  • Differential chemical reactivity of Cys and Sec in displayed peptides assayed by biotinylation

Feb 27

Patrick O’Donoghue

  • Phylogenomics
    • Infers phylogeny from data taken from whole genomes.
    • Important for understanding the tree of life, evolution and properties of pathogens, human evolution...
  • Background
    • Charles Darwin proposed and sketched a theoretical "tree of life" in Origin of Species (1859)—evolution proceeds through diversification, so all life can be traced back to a single source.
    • Early ideas not based on data: three kingdoms (1866), five kingdoms (1968), etc.
    • Carl Woese et al., 1977: three domains: Eukaryota, Bacteria, Archaea.
      • Based on molecular data.
      • Ribosomal RNA (rRNA) is universal among cellular organisms and highly conserved.
    • Variations: unrooted tree (1993); "net of life" (2005).
  • Molecular phylogenetics provides data from comparisons of gene and protein sequences to help deduce the evolutionary relationships between organisms.
    • Supports the rRNA tree of life but complicates and enriches it.
    • Paralogs show a gene history that extends back beyond LUCA.
    • Non-rRNA-based phylogenies.
    • Every gene has its own history. Different genes evolve at different rates.
  • Proteins can be compared by sequence (identity or similarity) and structure. Structural homologies can tell us more about the earliest reaches of the history of life.
    • Examples of aminoacyl-RNA synthetases (aaRSs).
  • Relationships between branches of the phylogenetic tree show common ancestries but also differences in the rate of evolution of different genes.
  • Methods of building trees:
    • Clustering algorithms.
    • Optimality methods: parismony, maximum likelihood, heuristic search.
  • Protein domains and superfamilies contribute to phylogenetic understanding: presence/absence and evolutionary distance.
  • Genome conservation
    • Weights the average sequence similarity by number of homologs between genomes.
    • Reinforces canonical rRNA tree.
  • Horizontal Gene Transfer (HGT) complicates tree of life—the "net of life."
  • Pathogen evolution—tracks changes in virulence and geographical spread.
  • Human evolution: phylogenomics may help us better understand how and why we differ from macacques and chimpanzees. Different genes have evolved at different rates—we are closely related, but some genes may have changed rapidly.
  • Horizontal Gene Transfer
    • Evidence that genes are transferred between bacteria and archaea.
    • Something similar in eukaryotes? — Endogenous retroviruses (ERVs) and transposons in humans.
  • Phylogenomics
    • Provides a sensitive measure of phylogenetic relationships, and different rates of gene evolution.
    • Represents explicitly both vertical and horizontal gene transfer.

Mar 5

MG

  • Discussion topic: What is Bioinformatics, what are things you need to know and what are some peripheral topics?
  • Concentrating on sequences and comparisons
  • Basic steps for alignments
    • make a dot plot
    • compute the matrix of sums
    • trace back through the sums to find alignment – there can be alternate tracebacks, which will give different alignments
    • Gaps – want to be able to account for possible gaps in a alignment, so introduce a gap penalty
      • Gap = a +bN
      • a = opening a gap
      • b = extending a gap
      • N= length of the gap
  • Should be able to align 4-mers
  • Dynamic Programming – idea is to solve a sub problem and keep adding more solved sub problems to it to make it more efficient, won’t have to constant go back to solve the same problem over again
  • Similarity (substitution) matrix – can be more tricky
    • Can be used to align things like protein structures and alignments
    • Two common matrices are PAM and BLOSSUM
    • BLOSUM 62 is default for the BLAST matrix
  • Two different types of alignments
    • Needleman-Wunsh (global alignment) –best alignment for the entirety of both sequences
    • Smith-Waterman (local alignment) – best alignment of segments without regard to the rest of the sequences
      • Use negative scores for a mismatch
      • Have the min score in the matrix be zero
      • Can find best score anywhere in the matrix, not just the last column or row

Mar 24

MG

  • multiple sequence alignments
    • attempting to use dynamic programming is too computationally complex
    • instead use progressive multiple alignments
      • do a pairwise alignment for the most closely related sequences
      • replace this pair of sequences with a single sequence that represents their alignment
      • repeat until all sequences are aligned
    • results in a tree of sequences
    • not necessarily optimal, but should be close
  • motifs
    • are regular expression representations of sequences
    • can represent more complex ideas then simple sequence alignments
  • Hidden Markov models
    • comprised of match states, each corresponding to a column in a multiple alignment
    • each state has probabilities for emitting symbols (corresponding to amino acids or nucleotides) and probabilities to transition to other states (including itself)
    • gaps and insertions can easily be modeled
    • Common Algorithms:
      • Forward: Summing over the probabilities of all paths that emit a given sequence.
      • Viterbi: Finds the most probable path through the model for a given sequence.
  • Scores
    • formula for score: S = ΣS(i,j) – nG
    • scores are most meaningful when put in relation to other scores, such as P values
    • scores typically follow extreme value distributions
  • ROC curve (Receiver operating characteristic curve)
    • shows sensitivity vs. specificity

Mar 26

Kei Cheung

The Human Genome Project ushered a new era in biology, shifting the focus from wet bench biology to computational biology.

Three different “versions” of the web:

  • Web 1.0 – The original WWW
    • Pages designed for a single person/group
    • Read-only
    • Static
    • Document-centric, html-based
    • Human-readable
  • Web 2.0
    • More collaborative (social networking)
    • Added multimedia content
      • Podcasts, videos, etc.
    • Pages now interactive/dynamic
      • Web services
    • Ability to create “data mashups” from distinct sources
      • Yahoo Pipes is one such method to create mashups
      • Examples: Cancer vs. water pollution map overlay, Avian flu outbreak map using Google Earth, etc.
      • Can mix different data types to form one cohesive analysis
    • Community-based
    • XML
  • Web 3.0
    • Semantic web
      • Common framework
      • RDF
      • URI and graph structure
      • Data represented by triples (subject, predicate, value)
      • Focus on machine-readable metadata
    • Ontologies
      • Representation of concepts and their relationships
    • Topic map
      • ISO standard
      • A different triple (topics, associations, and occurrences)
      • Organizes information in a way that can be optimized for navigation
      • XTM and TMQL as the key languages
    • Benefits to Web 3.0
      • Human and machine readable data
      • Social and semantic network
      • Syntactic and semantic mashup
      • Platform for data/tool sharing and integration, scientific collaboration, and e-learning.
    • More use cases needed to convince scientists that is is worth the extra startup effort for these types of projects
  • Problems with Web 1.0 and 2.0
    • Lack of annotation
    • Lack of links
    • Lack of link semantics
    • Lack of data semantics
    • Lack of standards
    • People don’t adhere to existing standards

Mar 31

MG

  • Definition of TN,TP,FN,FP
    • TP = Predicted positive and actually positive
    • TN = Predicted negative and actually negative
    • FP = Predicted positive but actually negative
    • FN = Predicted negative but actually positive
    • AUC (area under the curve, under the ROC curve)
    • Specificity = TN/(TN+FP)
    • Sensitivity = TP/(TP+FN)
    • Score becomes less significant with increase in database size
    • Simpler sequence regions have smaller significance
  • Computational complexity
    • NW alignment O(n3) time and O(n2) space
  • FASTA makes hash table of short words
  • BLAST extends hash word hits without any gaps while the extension is favorable
  • BLAST 2 joins words like FASTA
  • PSI BLAST
    • Do one blast search
    • Get results
    • Use to build profile
    • Search with profile as query
    • Iterate
  • Sequence to Structure
    • Secondary structure prediction
    • Predicting transmembrane helices

Apr 2

MG

 TM Prediction

  • Window average employed to predict GES
  • Peak --> helix? Verify by transforming graph
  • Probability that a residue has a secondary structure:
    • Scale based of db freq vs. exp freq: log(odds) transformation
    • Freq of particular residue/general freq= desired freq
    • Find gold standard in training set: A in DB/ A in total
  • Cell cytoplasm is more negative than outside of cell

GOR (Parametrical Statistical Prediction)

  • Classic method of predicting secondary structure
  • Early method; adding up a bunch of log-odds
  • How much info is produced abt a particular residue to be in a particular location
  • Independent events; frequency of residue affects position of helix
  • Pr(helix) relative to not being helix (odds ratio)
    • Issue: going to have to construct lots of probability bins
    • 3 diff positions * 17 positions * 20amino acids
    • Training sets is what we observe
    • Asp beta-predictor
    • Pro one down --> low tendency to be in helix
    • N-terminal positive, C-terminal negative --> higher tendency of positive Lysine at the C-terminal

Mini GOR:

  • 2 residue windows, 3 amino acids

Assessment:

  • Helices, strands and coils are not equi-abundant; more helices in proteins than coil
  • Certain types of measurements can be differentially penalized
  • Q3 assessment reveals that simple GOR gets 65% but sometimes under predicts...
  • Over training: number in training sets vs. number in test set... be careful to report only what’s in the test set... Parameterizing choice on test set is not a good practice
  • Nonparametric Predictors; Semi-parametric; single sequence and multiple sequence are other methods of predicting secondary structure
  • GOR Semi-Parametric: filtering GOR

Apr 14

MG

Protein Geometry:

  • Based on X,Y,Z coordinates
  • Derivative concepts:
    • Distance
    • Surface area
    • Volume
    • Axes
    • Angle
  • Relates to energy functions and dynamics

Comparing Structures:

Given a structure A and B compare their structure by performing the following operations:

  • Given an alignment, optimally superimpose A onto B
    • Involves a rotation and a translation step
    • Minimize the root mean square (RMS) distance between A and B (consisting of 6 parameters)
  • Find an alignment between A and B based on their 3D coordinates
    • Use a similarity matrix which depends on the 3D coordinates
    • Threading: An entry in the similarity matrix Sij depends on the how well the amino acid at position i in protein 1 fits into the 3D structural environment at position j of protein 2
    • Structural alignment violates the central idea of dynamic programming because the calculation of the alignment of residue i affects the previous optimal alignments
    • This issue can be surmounted by using an iterative approach as outlined below:
      1. Compute similarity matrix
      2. Align via dynamic programming
      3. RMS fit based on alignment
      4. Move Structure B
      5. Re-compute similarity
      6. If changed from #1, GOTO #2

Apr 16

MG

  • Other Aspects of Structure, Besides just Comparing Atom Positions
    • Atom Position, XYZ triplets
    • Lines, Axes, Angles
    • Surfaces, Volumes
  • Voronoi Volumes
    • Each atom surrounded by a single convex polyhedron and allocated space within it
    • Packing efficiency: V(VDW) / V(Voronoi)
      • Volumes are directly related to packing efficiency
      • Missing atoms give looser packing
    • Problem of protein surface: sometimes not enough atoms to create proper polyhedron
    • Problem: not all atoms same size
      • Solution: type the atoms
    • Small packing changes significant
  • Protein Surface
    • Delauney triangulation (rough, CS style)
      • Natural way to define packing neighbors
      • Convex hull
    • Richards’ molecular and accessible surfaces (smooth, chem style)
    • Packing defines the “correct definition” of the protein surface
      • Voronoi polyhedra are the “natural” way to study packing
  • Simulation
    • Electrostatics + basic forces
    • Potential functions
    • Maxwell’s equations
      • Relating electric fields with magnetic fields
    • Packing in terms of VDW forces
    • Potential energy in bond length springs
      • Torsion angles
    • Springs and electrical forces contribute to potential energy
    • Conventional macromolecular potential functions are simplified in order to make it easier to create models
      • Ex: bonds as springs

Apr 21

MG

Basic Protein Model

  • Geometry
    • Coordinates
    • Volume
    • Surface area
  • Energetics
    • Charges for electrical forces
    • Force constants for springs
    • Potential Function
  • Dynamics
    • Mass and time
    • Dynamics F = m dv/dt

Methods to move along the energy surface and find best (minimal) structure

  • Steepest descent minimization
    • Can get stuck in local minima
    • Follows gradient of energy downhill
  • Other methods of minimization
    • Conjugate gradient
    • Newton-Raphson
  • Molecular Dynamics
    • Gives atoms velocities and updates their movement
    • Simulates real protein motion
  • Ergodic Assumption
    • Eventually trajectory visits every state in phase space
  • Boltzmann weighting
    • The model will spend more time in low energy states than in higher energy states
  • Mark Chain Monte Carlo
    • Moves randomly through sates accepting next moves based on Boltzmann weighting
  • MD is more used for proteins and MCMC is more used for liquids
  • Simulated annealing
    • Begin with large steps through space (high heat)
    • Gradually decrease size of steps (cooling down)

Simplifying simulations

  • Neighbor list for atoms or max distance for neighbors set
  • Divide atoms into types
  • Initially
    • Associate each atom with mass and point charge
    • Give each atom initial velocity
  • Calculate potential
  • Update positions using MD
  • If needed using mirror images for neighboring systems

Related pages

Discussion Sections