assignments5

Modified: 2014/09/23 02:42 by admin - Uncategorized

We create a Google Group for the discussions. If you have any question, please post it there so everybody can see our reply, and try not to send us your questions directly via emails.

Edit

The Task: Implement the Stacking with Decision Tree

Description:

Implement a decision tree (named as DT): 1. For both classification and regression problems; 2. Handle both numerical and discrete attributes; 3. Use pruning technique to avoid over-fitting (optional).
Implement the Stacking methods with DT (named as SDT). Stacking involves training a learning algorithm to combine the predictions of base classifiers. First, all of the base classifiers are trained using the available data, then a meta classifier is trained to make a final prediction using all the predictions of the base classifiers as inputs. Here is a simple example for training:
- Split data X into K groups {D1,...,DK}, whose labels are {Y1,...,YK}
- Initialize N decision trees with different parameters (maximum depth, minimum number of nodes in leaf...)
- Construct the meta-level dataset M={}
- for i=1 to K
- S=[]
- for j=1 to N
- Train a decision tree fj on data X-Di
- Predict Di as Pij using fj
- S=[S Pij]
- M = M U {(S, Yi)}
- Train a meta classifier H (You can also use DT or the logistical regression you have implemented) on M
- for j=1 to N
- Train a decision tree fj on full data X
Then for predicting test set T:
- Construct the meta-level feature set F={}
- for j=1 to N
- Predict T as Pj using fj
- F=[F Pj]
- Predict on the meta-level features F using the meta classifer H
Here K is usually set to 10, and N the number of base classifiers.

Report your performance (accuracy for classification and root mean square error for regression, both via 10 fold cross validation) of both DT and SDT methods on the given benchmark dataset below.
Write a brief report to show your results. Has the Stacking technique improved the performance? If not, what may be the reason?

Edit

Benchmark Dataset

Classification: The same as Assignment 1, breast-cancer.data, segment.data
Regression: housing.data, meta.data

The last column stores the target attribute (label), while the first row indicates the type (discrete or numeric) of features

Edit

Programming Language

The choice you have made in the first assignment

Edit

Submission

Please use this MSWord template to report your results.
Do NOT plagiarize, plagiarism will be seriously penalized: You should be careful on writing your report. Whenever you are using words and works of others, citations should be made clear such that one can tell which part is actually yours. Details about how to identify a plagiarism can be found in "Introduction to the Guidelines for Handling Plagiarism Complaints".
Do NOT falsify results, data fraud will be even more seriously penalized: You should honestly record your results in the report, NEVER EVER modify the performance results manually.
Pack your report and code into a zip file named with your student ID, e.g., 'MG1433001.zip'. If you have multiple submissions, add an extra '_' with a number, such as 'MG1433001_1.zip'. We will use the the version with the largest number as your final submission.
- The file format should be zip, no other format is acceptable!
- NO submission after the deadline is acceptable!
- NO email submission will be accepted!

Upload your file to FTP: (please use FTP software to upload, do not use Windows Explorer or IE)
ftp://lamda.nju.edu.cn/mg_dm14/assignment5/
username/password: You will be informed in the first class

Edit

Evaluation

We will evaluate your submission according to your implementation and report.

For implementation :

Efficiency
Performance
Code style

For report:

Technique: clearly explain all the component you used in your implementation
Language: concise, precise, and logical.

If plagiarism is identified, no scores will be given to this report.

Edit

Contact TA

Mr. Qing Da and Mr. Yue Zhu

Back to assignment homepage
Back to course homepage