Assignment 2 Requirements & Suggestions

Modified: 2012/04/25 14:36 by admin - Uncategorized
(Back to the TA page)

Edit

Task Overview

Social network with location information is a kind of recently emerged mobile services, with providers of Foursquare, Gowalla, Jiepang (in China) and many others. In such location-based social network services, it is common that a user can add others as his friend, and can check-in at places.

In this task, you are provided a data set from Gowalla.com, including a friend-relationship network and check-in records. You are going to
  1. figure out an idea of what you can do with the data, and what data mining problem you are facing
  2. find and read papers related to your problem
  3. design a data mining approach to solve your problem
  4. implement your approach
  5. carry out experiments on the data set to assess your approach
  6. write a report

For details please read the rest of this webpage.

The author holds the copyright of the report. The report, unless agreed by the author, will not be disclosed to any third party.

Edit

Data

Download: the friend-relationship network data (4.6MB) and the check-in records (85.5MB). (The data is distributed under the BSD license)
In Gowalla_edges.txt, each line indicate the friendship of two users. In Gowalla_totalCheckins.txt, each line is a check-in record. The following image explains the content of the two files:

Image

Edit

The task

What are you going to do with the data? It is totally up to you.

Here are some FAQ for you, which are NOT required to be followed:

Q: Are the latitude and longitude records real?
A: Yes, they are real. You may locate the a place by search, say, 30.2359091167,-97.7951395833 directly in Google map.

Q: What programming language should I use to implement my approach?
A: Any programming language/environment is acceptable. However, there are some with convenience for data mining. You are encouraged to checkout:
  • R Project, an open-source statistical environment with many data mining packages using the R language
  • Weka, an open-source set of data mining/machine learning packages for Java
  • There are many data mining resources for MatLab, which is, unfortunately, not an open-source or free software.
  • SciLab and Octave are good open-source replacements of MatLab. You are able to convert MatLab codes to SciLab codes and to Octave codes.

Q: In the text book, the input of many learning algorithms are feature vectors, but there seems not some clear feature vectors in the data?
A: So you need to extract feature vectors from the given data.

(More FAQ are continually updated)

Edit

Assess your approach

After the design and implementation of your data mining approach, you are required to assess the performance of your approach by doing experiments on the data set. The goal of the assessment is to verify if your approach indeed meets your design, particularly, to show your results are meaningful.

The criteria for the assessment include but not limited to: prediction error, time consumption, visualization, interpretation, etc.

In the assessment, please be careful on the validity of your experiments. For example, if you draw a conclusion from data set A, you will need to verify if the conclusion holds on data set B, where A and B should have no overlap. A common method is the cross-validation. You can also learn how to do experiment from the papers you've read. Invalid experiment will not receive a high score.

You are encouraged to compare your approach with existing approaches. Though it is not a requirement.

Edit

Write a report

You are require to write a report to
  • Explain what/why you are going to deal with the data.
  • Introduce your data mining problem task.
  • Present your approach, from the concepts to the details. Use pseudo-codes if necessary. Do not list all your raw codes.
  • Describe your experiments, from the settings to the results.
  • Discuss the limitations of your approach, and list potential solutions.

Please use to write your report in Chinese. You should follow the outline of the template restrictively, and note:
  • Follow the template, do NOT modify black text, but only replace the red text with your own words.
  • Save your finished DOC file to a PDF file for submission.

Must NOT plagiarize:
  • Whenever you are using words, tables, figures and any work of others, citations should be made clear.
  • A continuous verbatim copy of more than 50 English/Chinese words is identified as a plagiarism, regardless of citations.

Edit

Submit your work

Pack your source codes and your PDF report in a compressed file (e.g. a ZIP file), naming with your student ID (e.g., '091221154.ZIP').

Upload your file to FTP: (recommend to use FTP software rather than Windows Explorer or IE)
     ftp://lamda.nju.edu.cn/dm/assignment2/
     username: dm12
     password: dm12

No reports in other file formats are acceptable.
No submissions after the deadline are acceptable.
No email submissions are accepted.

Edit

How is your work evaluated

A report with a fully explained idea, technique details, and valid experiments will receive a good score.

No score will be given to a plagiarism or fake experiment.
(You should submit your complete source codes. When I feel anything suspicious in the report, I will run your codes.)

Edit

Presentation

Five randomly drawn students are going to give presentations on their work, which will address
  • What you want to mine from the data?
  • Why it is interesting to you?
  • What is your approach to mining?
  • How is the approach evaluated?
  • What is the evaluation result?

Edit

License

The data is distributed under the BSD license:
* Copyright (c) 2007-2010, Jure Leskovec
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*     * Redistributions of source code must retain the above copyright
*       notice, this list of conditions and the following disclaimer.
*     * Redistributions in binary form must reproduce the above copyright
*       notice, this list of conditions and the following disclaimer in the
*       documentation and/or other materials provided with the distribution.
*     * Neither the name of Stanford University nor the
*       names of its contributors may be used to endorse or promote products
*       derived from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
* DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The end