YaleGerstein Lab

Developing a similarity measure in biological function space

Haiyuan Yu*, Ronald Jansen*, and Mark Gerstein

Abstract: Many classifications of protein function such as Gene Ontology (GO) are organized in discrete categories within directed acyclic graph (DAG) structures. The computation of a numerical measure of functional similarity between two arbitrary proteins within such DAG structures is an important problem. Here we develop a simple probabilistic measure for this quantity. Our measure is based on counting the number of protein pairs that share exactly the same set of functional annotations in relation to the total number of classified pairs. We show such a measure is associated with a power-law distribution, allowing the quick assessment of the statistical significance of shared functional annotations. We formally compare it with other quantitative functional similarity measures (such as shortest path, lowest common ancestor, and information-theoretic similarity) and provide concrete metrics to assess differences. Finally, we provide a practical implementation for our probabilistic measure for GO and for the MIPS functional catalog and give two applications of it in specific functional genomics contexts.


Supplementary Scripts and Data
  I. Format GO annotations
 1. Download process.ontology describing all terms in biological process in GO and gene_association.sgd describing the functions for each gene in yeast genome.
 2. Parse process.ontology using process_ontology_hy.pl
 3. Parse gene_association.sgd using go_process_orf.pl to produce go_process_orf.txt
 4. Run format_input.pl to produce process_graph_hy.txt and process_terms_for_orf_hy.txt
 5. Run process_orf_graph.pl to produce process_orf_graph.txt
 6. MIPS functional annotations are parsed in a similar fashion (though much easier) to produce mips_orf_graphs.txt and MIPS.txt as input for following calculations
  II. Calculate topological distances for GO
 1. Run adj_matrix_matlab.pl to produce go_process_adj_sparse.m
 2. Run go_topo_dist.m in Matlab to produce go_process_dist.txt, which describes the topological distances between all pairs of GO terms involved in biological process.
 3. Run go_process_orf_topo_dist.pl to produce go_process_orf_topo_mindist.txt (describing the minimal topological distance between yeast gene pairs) using go_process_node.txt
  III. Calculate topological distances for MIPS
 1. Run mips_dist.m in Matlab using mips_class_adj.m to produce mips_dist.txt, which describes the topological distances between all pairs of MIPS terms.
 2. Run orf_topo_dist.pl to produce orf_topo_mindist.txt (describing the minimal topological distance between yeast gene pairs) using mips_node.txt
  IV. Calculate probabilistic scores for GO
 1. Run all_intersection_graphs_go.pl
 2. Run pair_probabilities_all3_go.pl to produce GO_FS.txt, which describs the number of genes pairs sharing the same LCA set as a certein pair. This number is divided by the total number of gene pairs in GO as the probabilistic score. For comparison purposes, one can simply use this count number, rather than the actual score.
  V. Calculate probabilistic scores for MIPS
 1. Run all_intersection_graphs_mips.pl
 2. Run pair_probabilities_all3_mips.pl to produce MIPS_FS.txt, which describs the number of genes pairs sharing the same LCA set as a certein pair. This number is divided by the total number of gene pairs in MIPS as the probabilistic score. For comparison purposes, one can simply use this count number, rather than the actual score.
  VI. Semantic scores for GO
 1. GOS_BP_Semantic.txt, courtesy of Dr. Azuaje and his collegues (Wang H, et al., 2004, Gene Expression Correlation and Gene Ontology-Based Similarity: An Assessment of Quantitative Relationships. In Proceedings of the 2004 IEEE symposium on Computational Intelligence in Bioinformatics and Computational Biology pp 25-31).

*These two authors contribute equally to this work

Last modified on Sept. 25th, 2006