1.
Summary
This package contains the text data (Reuters-21578)
for multi-instance multi-label learning, which has been used in:
ATTN: You
can feel free to use the package (for academic purpose only) at your own risk.
An acknowledge or citation to the above paper is required. For other purposes,
please contact Prof. Zhi-Hua Zhou (zhouzh@nju.edu.cn).
Download: datafile (141Kb)
2.
Details
The text data is derived from
the widely studied Reuters-21578 collection [1]. The seven most frequent categories
are considered. After removing documents whose label sets or main texts are
empty, 8,866 documents are retained where only 3.37% of them are associated
with more than one class labels. After randomly removing documents with only
one label, a text categorization data set containing 2,000 documents is obtained.
Around 15% documents with multiple labels comprise the resultant data set and
the average number of labels per document is 1.15 ± 0.37.
Each document is represented as a bag of instances using the sliding window
techniques [2], where each instance corresponds to a text segment enclosed in
one sliding window of size 50 (overlapped with 25 words). ``Function words"
on the \textsc{Smart} stop-list [3] are removed from the vocabulary and the
remaining words are stemmed. Instances in the bags adopt the "Bag-of-Words"
representation based on term frequency [1]. Without loss of effectiveness, dimensionality
reduction is performed by retaining the top 2% words with highest document frequency
[4]. Thereafter, each instance is represented as a 243-dimensional feature vector.
Following are specific characteristics
of the resultant text data:
Number of examples: 2,000
Number of classes: 7
Number of features: 243
Instances per bag:
min: 2
max: 26
mean ± std.
deviation: 3.56 ± 2.71
Labels per example
(k):
k=1: 1,701
k=2: 290
k=3: 9
After reading the data into MATLAB
environment, the i-th text data (in bag representation) is stored in
bags{i,1} with its associated labels in target(:,i). For illustration purpose,
suppose target(:,i)' equals [1 -1 -1 1 -1 -1 1], it means that the i-th
text document belongs to the 1st, 4th and 7th classes but do not belong to the
rest classes.
[1] F. Sebastiani.
Machine learning in automated text categorization. ACM Computing Surveys,
34(1): 1-47, 2002.
[2] S. Andrews,
I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance
learning. In: Advances in Neural Information Processing Systems, pp.
561-568. MIT Press, Cambridge, MA, 2003.
[3] G. Salton.
Automatic Text Processing: The Transformation, Analysis, and Retrieval
of Information by Computer, Addison-Wesley, Reading, Pennsylvania, 1989.
[4] Y. Yang and
J. O. Pedersen. A comparative study on feature selection in text categorization.
In: Proceedings of the 14th International Conference on Machine Learning,
pp. 412-420, Nashville, TN, 1997.