2.
Details
Complete proteome on seven real-world organisms covering the biological three-domain system[1] are considered including two bacteria genomes (Geobacter sulfurreducens, Azotobacter vinelandii), two archaea genomes(Haloarcula marismortui, Pyrococcus furiosus) and three eukaryote genomes(Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster). For each organism, complete proteome with manually annotated function has been downloaded from the Universal Protein Resource (UniProt) databank[2] (released by April, 2013) by querying the terms of {"organism name" AND "reviewed: yes" AND "keyword: Complete proteome"}.
Redundancy among the protein sequences of each organism is removed by clustering analysis using the blastclust executable program in the BLAST package[3] from NCBI with a threshold of 90% for sequence identity, and a non-redundant dataset is created by retaining only the longest sequence in each cluster for each organism[4]. Then, each non-redundant dataset is uploaded as a txt file into the Batch CD-Search servers[5] of NCBI for achieving conserved domains. For each domain, a frequency vector with 216-dimensions is used for its representation where each element denotes the frequency of a triad type[6]. Protein function can be described in multiple ways, and the most well-known and widely used one is Gene Ontology Consortium[7] which provides ontology in three aspects: molecular function, biological process and cellular location. We focus on the molecular function aspect. We obtain the GO molecular function terms with manual annotation for a protein from the downloaded UniProt format text file. Then, the same strategy as [8] is adopted for prepare label vectors for a protein based on a hierarchal directed acyclic graph (DAG) of GO molecular function, and the latest version (December 2006) of GO function ontology is used as the bases of the functional terms and their relations.
In the MIML learning framework, each protein is represented as a bag of instances where each instance corresponds to a domain and is labeled with a group of GO molecular function terms (multi-labels). Detailed descriptions of datasets, i.e., complete proteome on seven real-world organisms, are summarized in Table 1. For example, there are 379 proteins (examples ) with a total of 320 gene ontology terms (label classes) on molecular function in the Geobacter sulfurreducens dataset (Table 1). The average number of instances (domains) per bag (protein) is 3.20±1.21, and the average number of labels (GO terms) per example (protein) is 3.14±3.33 (Table 1).
Table 1. Characteristics of the data sets
Genome
|
examples
|
classes
|
Instances
per bag
(Mean±
std.)
|
Labels
per example
(Mean±
std.)
|
Bacteria
|
Geobacter
sulfurreducens
|
379
|
320
|
3.20±1.21
|
3.14±3.33
|
Azotobacter
vinelandii
|
407
|
340
|
3.07±1.16
|
4.00±6.97
|
Archaea
|
Haloarcula
marismortui
|
304
|
234
|
3.13±1.09
|
3.25±3.02
|
Pyrococcus
furiosus
|
425
|
321
|
3.10±1.09
|
4.48±6.33
|
Eukaryota
|
Saccharomyces
cerevisiae
|
3509
|
1566
|
1.86±1.36
|
5.89±11.52
|
Caenorhabditis
elegans
|
2512
|
940
|
3.39±4.20
|
6.07±11.25
|
Drosophila
melanogaster
|
2605
|
1035
|
3.51±3.49
|
6.02±10.24
|
[1] C. R. Woese, O. Kandler, and M. L. Wheelis, "Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya," Proc Natl Acad Sci U S A, vol. 87, pp. 4576-9, Jun 1990.
[2] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Natale, C. O'Donovan, N. Redaschi, and L. S. Yeh, "UniProt: the Universal Protein knowledgebase," Nucleic Acids Res, vol. 32, pp. D115-9, Jan 1 2004.
[3] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," J Mol Biol, vol. 215, pp. 403-10, Oct 5 1990.
[4] J. Wu, H. Liu, X. Duan, Y. Ding, H. Wu, Y. Bai, and X. Sun, "Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature," Bioinformatics, vol. 25, pp. 30-5, Jan 1 2009.
[5] A. Marchler-Bauer, S. Lu, J. B. Anderson, F. Chitsaz, M. K. Derbyshire, C. DeWeese-Scott, J. H. Fong, L. Y. Geer, R. C. Geer, and N. R. Gonzales, "CDD: a Conserved Domain Database for the functional annotation of proteins," Nucleic acids research, vol. 39, pp. D225-D229, 2011.
[6] J. Wu, D. Hu, X. Xu, Y. Ding, S. Yan, and X. Sun, "A novel method for quantitatively predicting non-covalent interactions from protein and nucleic acid sequence," J Mol Graph Model, vol. 31, pp. 28-34, Nov 2011.
[7] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, "Gene ontology: tool for the unification of biology. The Gene Ontology Consortium," Nat Genet, vol. 25, pp. 25-9, May 2000.
[8] O. S. Sarac, V. Atalay, and R. Cetin-Atalay, "GOPred: GO molecular function prediction by combined classifiers," PLoS One, vol. 5, p. e12382, 2010.