1.
Summary
This package contains the text data for multi-instance
learning, which has been used in:
ATTN: You can feel free to use the package (for
academic purpose only) at your own risk. For other purposes, please contact
Prof. Zhi-Hua Zhou (zhouzh@nju.edu.cn).
Download:
[datafile] (1.49Mb)
2.
Details
The twenty text categorization
data sets were derived from the 20 Newsgroups corpus popularly used in
text categorization. Fifty positive and fifty negative bags were generated for
each of the 20 news categories. Each positive bag contains 3% posts randomly
drawn from the target category and the other instances (and all instances in
negative bags) randomly and uniformly drawn from other categories. Each instance
is a post represented by the top 200 TFIDF features.
Following are specific characteristics
of the resultant text data:
Number of examples: 2,000
Number of classes: 20
Number of features: 200
Instances per bag:
min: 8
max: 84
mean±std.: 40.07±15.27
After loading the data into MATLAB
environment, the i-th text data (in bag representation) is stored in
bags{i,1} with its associated labels in bags{i,2} and the labels of instances
in bags{i,3}.
In the package we also provide
detailed results of the miGraph approach (see our ICML'09 paper) on the data.
ATTN2:
This
package was developed by Ms. Yu-Yin Sun (sunyy@lamda.nju.edu.cn).
For any problem concerning the package, please feel free to contact Ms. Sun.