Data for Genome-Wide Protein Function Prediction through Multi-instance Multi-label Learning

1. Summary

The data set contains seven real-world organisms covering the biological three-domain system, i.e., archaea, bacteria, and eukaryote, and each organism contains two files, i.e., the instance and label files.

- The "instances.txt" file is to express the proteins to train the MIML classifiers. The unique ID with 6 characters in the lines with the head of ">" is the protein identify used by the Uniprot Database. The other lines is to describe the domain instances, and each line represent an instances.
- The "labels.txt" file is to express the GO labels to train the MIML classifiers.

The data set has been used in:

  • Jian-Sheng Wu, Sheng-Jun Huang, Zhi-hua Zhou. Genome-Wide Protein Function Prediction through Multi-instance Multi-label Learning. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), DOI 10.1109/TCBB.2014.2323058, 2014..


ATTN:   You can feel free to use the package (for academic purpose only) at your own risk. An acknowledge or citation to the above paper is required. For other purposes, please contact Prof. Zhi-Hua Zhou (

    Download: datafile (3.6Mb)

2. Details

Complete proteome on seven real-world organisms covering the biological three-domain system[1] are considered including two bacteria genomes (Geobacter sulfurreducens, Azotobacter vinelandii), two archaea genomes(Haloarcula marismortui, Pyrococcus furiosus) and three eukaryote genomes(Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster). For each organism, complete proteome with manually annotated function has been downloaded from the Universal Protein Resource (UniProt) databank[2] (released by April, 2013) by querying the terms of {"organism name" AND "reviewed: yes" AND "keyword: Complete proteome"}.

Redundancy among the protein sequences of each organism is removed by clustering analysis using the blastclust executable program in the BLAST package[3] from NCBI with a threshold of 90% for sequence identity, and a non-redundant dataset is created by retaining only the longest sequence in each cluster for each organism[4]. Then, each non-redundant dataset is uploaded as a txt file into the Batch CD-Search servers[5] of NCBI for achieving conserved domains. For each domain, a frequency vector with 216-dimensions is used for its representation where each element denotes the frequency of a triad type[6]. Protein function can be described in multiple ways, and the most well-known and widely used one is Gene Ontology Consortium[7] which provides ontology in three aspects: molecular function, biological process and cellular location. We focus on the molecular function aspect. We obtain the GO molecular function terms with manual annotation for a protein from the downloaded UniProt format text file. Then, the same strategy as [8] is adopted for prepare label vectors for a protein based on a hierarchal directed acyclic graph (DAG) of GO molecular function, and the latest version (December 2006) of GO function ontology is used as the bases of the functional terms and their relations.

In the MIML learning framework, each protein is represented as a bag of instances where each instance corresponds to a domain and is labeled with a group of GO molecular function terms (multi-labels). Detailed descriptions of datasets, i.e., complete proteome on seven real-world organisms, are summarized in Table 1. For example, there are 379 proteins (examples ) with a total of 320 gene ontology terms (label classes) on molecular function in the Geobacter sulfurreducens dataset (Table 1). The average number of instances (domains) per bag (protein) is 3.20±1.21, and the average number of labels (GO terms) per example (protein) is 3.14±3.33 (Table 1).

Table 1.   Characteristics of the data sets




Instances per bag

(Mean± std.)

Labels per example

(Mean± std.)


Geobacter sulfurreducens





Azotobacter vinelandii






Haloarcula marismortui





Pyrococcus furiosus






Saccharomyces cerevisiae





Caenorhabditis elegans





Drosophila melanogaster





[1] C. R. Woese, O. Kandler, and M. L. Wheelis, "Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya," Proc Natl Acad Sci U S A, vol. 87, pp. 4576-9, Jun 1990.

[2] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Natale, C. O'Donovan, N. Redaschi, and L. S. Yeh, "UniProt: the Universal Protein knowledgebase," Nucleic Acids Res, vol. 32, pp. D115-9, Jan 1 2004.

[3] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," J Mol Biol, vol. 215, pp. 403-10, Oct 5 1990.

[4] J. Wu, H. Liu, X. Duan, Y. Ding, H. Wu, Y. Bai, and X. Sun, "Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature," Bioinformatics, vol. 25, pp. 30-5, Jan 1 2009.

[5] A. Marchler-Bauer, S. Lu, J. B. Anderson, F. Chitsaz, M. K. Derbyshire, C. DeWeese-Scott, J. H. Fong, L. Y. Geer, R. C. Geer, and N. R. Gonzales, "CDD: a Conserved Domain Database for the functional annotation of proteins," Nucleic acids research, vol. 39, pp. D225-D229, 2011.

[6] J. Wu, D. Hu, X. Xu, Y. Ding, S. Yan, and X. Sun, "A novel method for quantitatively predicting non-covalent interactions from protein and nucleic acid sequence," J Mol Graph Model, vol. 31, pp. 28-34, Nov 2011.

[7] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, "Gene ontology: tool for the unification of biology. The Gene Ontology Consortium," Nat Genet, vol. 25, pp. 25-9, May 2000.

[8] O. S. Sarac, V. Atalay, and R. Cetin-Atalay, "GOPred: GO molecular function prediction by combined classifiers," PLoS One, vol. 5, p. e12382, 2010.

  Name Size

PoweredBy © LAMDA, 2022