Assignment 1

Modified: 2014/09/28 12:27 by admin - Uncategorized

We create a Google Group for the discussions. If you have any question, please post it there so everybody can see our reply, and try not to send us your questions directly via emails.

Edit

The Task: Implement the Naive Bayes algorithm

Description:

Implement the Naive Bayes methods for discrete attributes (Named as NBD)
Implement the Naive Bayes methods for continuous attributes in two ways: 1. Discretize the continuous attributes and then apply NBD (Named as NBCD);2. Use Gaussian distribution to estimate the conditional probability (Named NBCG)
Conduct 10 fold cross validation on the benchmark data with TF features by NBD, report the mean and standard variance of accuracy
Conduct 10 fold cross validation on the benchmark data with TF-IDF features by NBCD and NBCG, report the mean and standard deviation of accuracy.
Write a brief report to show your results, and think about the question: Among the three methods for the same task, which one is better? Can you give an analysis?

Edit

Benchmark Dataset

Data: Forum Classification based on the posts (from the Lily BBS) [Download]

Forum Name	Chinese Name	Number of posts
D_Computer	计算机系	138
V_Suggestions	校长信箱	171
Mobile	手机天地	192
Basketball	篮球	178
JobExpress	就业特快	113
Stock	股市风云	185
WarAndPeace	百年好合	156
Girls	女生天地	170
FleaMarket	跳蚤市场	159
WorldFootball	世界足球	167

Format: The zip file contains 10 txt files. Each file contains more than 100 lines, and each line represents a raw post, where the post is kept in the form of "title\tcontent", for example

Preprocess: (For this step, you can use any language or any tool) You need to extract TF or TF-IDF features for yourself from the raw text data. A quick introduction can be found here. For our case, you also need to do the Tokenization job to get the words from the Chinese sentences. We recommend you to use some existing tools like jieba for python, or IKAnalyzer for java for the tokenization. When you encode the words to feature, you may also think about the Stop words. We provide a version of the Chinese stop words list here, while you may find some other ones better. There may be some noises in the data. However, in real data mining task, there are always noises. So try to deal with the noise by yourself, and DO NOT complain about that. Please finish the pre-processing job carefully since it will be also used in the following assignments.

Edit

Programming Language

Python / Java / MATLAB
Once you pick up one of them, the following assignments MUST be based on the SAME Language, otherwise you will NOT get the corresponding grades.

Edit

Submission

Please use this MSWord template to report your results.
Do NOT plagiarize, plagiarism will be seriously penalized: You should be careful on writing your report. Whenever you are using words and works of others, citations should be made clear such that one can tell which part is actually yours. Details about how to identify a plagiarism can be found in "Introduction to the Guidelines for Handling Plagiarism Complaints".
Do NOT falsify results, data fraud will be even more seriously penalized: You should honestly record your results in the report, NEVER EVER modify the performance results manually.
Pack your report and code into a zip file named with your student ID, e.g., 'MG1433001.zip'. If you have multiple submissions, add an extra '_' with a number, such as 'MG1433001_1.zip'. We will use the the version with the largest number as your final submission.
- The file format should be zip, no other format is acceptable!
- NO submission after the deadline is acceptable!
- NO email submission will be accepted!

Upload your file to FTP: (please use FTP software to upload, do not use Windows Explorer or IE)
ftp://lamda.nju.edu.cn/mg_dm14/assignment1/
username/password: You will be informed in the first class

Edit

Evaluation

We will evaluate your submission according to your implementation and report.

For implementation :

Efficiency
Performance
Code style

For report:

Technique: clearly explain all the component you used in your implementation
Language: concise, precise, and logical.

If plagiarism is identified, no scores will be given to this report.

Edit

Contact TA

Mr. Qing Da and Mr. Yue Zhu

Back to assignment homepage
Back to course homepage