Search
»

Data for Multi-Instance Learning Based Web Index Recommendation

1. Summary

This package contains two parts:

- The "original" part contains 113 web index pages and their links.

Since every web index page has lots of links, this part is quite big, about 126Mb (30.9Mb after compression).

- The "processed" part contains 9 data sets for multi-instance learning.

This part is not big, about 5.67Mb (1.36Mb after compression).

The web index pages are mainly from:

1) http://www.yahoo.com

2) http://www.cnn.com

3) http://www.foxnews.com

The data set has been used in: Z.-H. Zhou, K. Jiang, and M. Li. Multi-instance learning based web mining. Applied Intelligence, 2005, 22(2): 135-147.

ATTN: You can feel free to use the package (for academic purpose only) at your own risk. An acknowledge or citation to the above paper is required. For other purposes, please contact Prof. Zhi-Hua Zhou (zhouzh@nju.edu.cn).

Download: [datafile] (30.2Mb)

2. Details

The 113 web index pages are labeled by 9 volunteers according to their interests. Therefore there are 9 data sets. If the volunteer is interested in at least one linked page of the index, then the web index page is labeled as positive. Otherwise the index page is labeled as negative. There is no label for the linked pages.

For each of the 9 data sets, 75 web index pages are randomly selected as training examples while the remaining 38 pages are used as test examples.

The training and test sets are named as v1.train, v1.test, ...

The class distributions of the data sets are:
---------------------------------------------------------------------
        data     positive       negative       positive       negative
         set     in train set     in train set     in test set      in test set
---------------------------------------------------------------------
         v1           17              58                  4                 34
         v2           18              57                  3                 35
         v3           14              61                  7                 31
         v4           56              19                33                  5
         v5           62              13                27                 11
         v6           60              15                29                  9
         v7           39              36                16                 22
         v8           35              40                20                 18
         v9           37              38                18                 20
---------------------------------------------------------------------

In the "original" part of the package, there is a page shows the IDs of the web index pages. In the "processed" part of the package, in each .train or.test file, the 1st line shows the IDs of the web index pages that are included in the file.

In the .train and .test files, the web index pages are represented inmulti-instance form. For example:
e01 {i11,i12,...,i1n}, ..., {im1,im2,...,imn},1.
where "e01" means that this is the 1st example (or bag) of the file, "{i11,i12,...,i1n}" is the 1st instance of e01, "i1j" is the value of the 1st instance of e01 on the jth attribute, the final '1' means this is a positive example.

The examples contain different number of instances. The biggest example is the 18th, which comprises 200 instances. The smallest example is the 90th, which comprises only 4 instances. In average, each example contains 30.29 instances (3423/113).

Each instance is described by 20 attributes that are the 1st to 20th most frequent terms appearing in the corresponding linked page. Note that it is not necessary to use all these attributes. The frequencies of the terms are included in the brackets following the terms.

In counting the occurrence of the frequent terms, 77 trivial terms are neglected (stoplist):

   {', a, about, also, am, an, and, are, as, at, b, be, been, but, by,
    can, com, could, didn't, do, doesn't, don't, during, for, from, had,
    has, have, he, her, here, him, his, i, if, in, is, it, just, m, me,
    might, no, not, of, on, or, our, out, over, she, so, still, td, that,
    the, their, them, there, they, this, to, too, us, was, we, were, what,
    where, when, who, whose, will, with, would, you, your}

Moreover, in order to get rid of links to advertisements or other index pages, it is constrained that for a linked page to be considered as an instance in an example, its corresponding link in the index page must contain at least four terms.

	Name	Size


(for FireFox 3+ and IE 7+)	Contact LAMDA: (email) contact@lamda.nju.edu.cn (tel) +86-025-89681608	© LAMDA, 2023


		© LAMDA, 2022