1.
Summary
This package contains two parts:
- The "original" part contains
113 web index pages and their links.
Since every web index
page has lots of links, this part is quite big, about 126Mb (30.9Mb after compression).
- The "processed" part contains
9 data sets for multi-instance learning.
This part is not big,
about 5.67Mb (1.36Mb after compression).
The web index pages are mainly from:
1) http://www.yahoo.com
2) http://www.cnn.com
3) http://www.foxnews.com
The data set has been used in: Z.-H. Zhou, K. Jiang,
and M. Li. Multi-instance
learning based web mining. Applied Intelligence,
2005, 22(2): 135-147.
ATTN: You
can feel free to use the package (for academic purpose only) at your own risk.
An acknowledge or citation to the above paper is required. For other purposes,
please contact Prof. Zhi-Hua Zhou (zhouzh@nju.edu.cn).
Download:
[datafile] (30.2Mb)
2.
Details
The 113 web index pages are labeled
by 9 volunteers according to their interests. Therefore there are 9 data sets.
If the volunteer is interested in at least one linked page of the index, then
the web index page is labeled as positive. Otherwise the index page is labeled
as negative. There is no label for the linked pages.
For each of the 9 data sets, 75
web index pages are randomly selected as training examples while the remaining
38 pages are used as test examples.
The training and test sets are named as v1.train, v1.test, ...
The class distributions of the
data sets are:
---------------------------------------------------------------------
data
positive negative
positive negative
set in
train set in train set
in test set in test set
---------------------------------------------------------------------
v1
17
58
4
34
v2
18
57
3
35
v3
14
61
7
31
v4
56
19
33
5
v5
62
13
27
11
v6
60
15
29
9
v7
39
36
16
22
v8
35
40
20
18
v9
37
38
18
20
---------------------------------------------------------------------
In the "original" part of the package, there is a page shows the IDs
of the web index pages. In the "processed" part of the package, in
each .train or.test file, the 1st line shows the IDs of the web index pages
that are included in the file.
In
the .train and .test files, the web index pages are represented inmulti-instance
form. For example:
e01 {i11,i12,...,i1n}, ..., {im1,im2,...,imn},1.
where "e01" means that this is the 1st example (or bag) of the file,
"{i11,i12,...,i1n}" is the 1st instance of e01, "i1j" is
the value of the 1st instance of e01 on the jth attribute, the final '1' means
this is a positive example.
The examples contain different
number of instances. The biggest example is the 18th, which comprises 200 instances.
The smallest example is the 90th, which comprises only 4 instances. In average,
each example contains 30.29 instances (3423/113).
Each instance is described by
20 attributes that are the 1st to 20th most frequent terms appearing in the
corresponding linked page. Note that it is not necessary to use all these attributes.
The frequencies of the terms are included in the brackets following the terms.
In counting the occurrence of
the frequent terms, 77 trivial terms are neglected (stoplist):
{', a, about,
also, am, an, and, are, as, at, b, be, been, but, by,
can, com, could, didn't, do, doesn't, don't, during,
for, from, had,
has, have, he, her, here, him, his, i, if, in, is, it,
just, m, me,
might, no, not, of, on, or, our, out, over, she, so,
still, td, that,
the, their, them, there, they, this, to, too, us, was,
we, were, what,
where, when, who, whose, will, with, would, you, your}
Moreover, in order to get rid
of links to advertisements or other index pages, it is constrained that for
a linked page to be considered as an instance in an example, its corresponding
link in the index page must contain at least four terms.