## Developing a similarity measure in biological function space | |

Haiyuan Yu*, Ronald Jansen*, and Mark Gerstein | |

| |

Supplementary Scripts and Data | |

I. Format GO annotations | |

1. Download process.ontology describing all terms in biological process in GO and gene_association.sgd describing the functions for each gene in yeast genome. | |

2. Parse process.ontology using process_ontology_hy.pl | |

3. Parse gene_association.sgd using go_process_orf.pl to produce go_process_orf.txt | |

4. Run format_input.pl to produce process_graph_hy.txt and process_terms_for_orf_hy.txt | |

5. Run process_orf_graph.pl to produce process_orf_graph.txt | |

6. MIPS functional annotations are parsed in a similar fashion (though much easier) to produce mips_orf_graphs.txt and MIPS.txt as input for following calculations | |

II. Calculate topological distances for GO | |

1. Run adj_matrix_matlab.pl to produce go_process_adj_sparse.m | |

2. Run go_topo_dist.m in Matlab to produce go_process_dist.txt, which describes the topological distances between all pairs of GO terms involved in biological process. | |

3. Run go_process_orf_topo_dist.pl to produce go_process_orf_topo_mindist.txt (describing the minimal topological distance between yeast gene pairs) using go_process_node.txt | |

III. Calculate topological distances for MIPS | |

1. Run mips_dist.m in Matlab using mips_class_adj.m to produce mips_dist.txt, which describes the topological distances between all pairs of MIPS terms. | |

2. Run orf_topo_dist.pl to produce orf_topo_mindist.txt (describing the minimal topological distance between yeast gene pairs) using mips_node.txt | |

IV. Calculate probabilistic scores for GO | |

1. Run all_intersection_graphs_go.pl | |

2. Run pair_probabilities_all3_go.pl to produce GO_FS.txt, which describs the number of genes pairs sharing the same LCA set as a certein pair. This number is divided by the total number of gene pairs in GO as the probabilistic score. For comparison purposes, one can simply use this count number, rather than the actual score. | |

V. Calculate probabilistic scores for MIPS | |

1. Run all_intersection_graphs_mips.pl | |

2. Run pair_probabilities_all3_mips.pl to produce MIPS_FS.txt, which describs the number of genes pairs sharing the same LCA set as a certein pair. This number is divided by the total number of gene pairs in MIPS as the probabilistic score. For comparison purposes, one can simply use this count number, rather than the actual score. | |

VI. Semantic scores for GO | |

1. GOS_BP_Semantic.txt, courtesy of Dr. Azuaje and his collegues (Wang H, et al., 2004, Gene Expression Correlation and Gene Ontology-Based Similarity: An Assessment of Quantitative Relationships. In Proceedings of the 2004 IEEE symposium on Computational Intelligence in Bioinformatics and Computational Biology pp 25-31). | |

*These two authors contribute equally to this work | |

Last modified on Sept. 25th, 2006 |