Data Mining (Fall, 2013)

Modified: 2015/01/21 16:54 by admin - Uncategorized
(Back to homepage)



  • Course Number: 081202B3
  • To: M. Sc. students of Department of Computer Science and Technology, Nanjing University.
  • Classroom: 207 Xian-II, Xianlin Campus
  • Time: 14:00 - 15:50, Tuesday
  • Office Hour: 16:00 - 17:00, Tuesday (Rm 919, Computer Science Building)
  • Main Reference Books:
    • D. Hand, H. Mannila, P. Smyth. Principles of Data Mining. MIT Press, MA:Cambridge, 2001.
    • J. Han, M. Kamber. Data Mining: Concepts and Techniques, 2nd edition. Morgan Kaufmann Publishers, 2006
    • I. H. Witten, E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, 3rd edition. Morgan Kaufmann Publishers, 2011
    • P.-N. Tan, M. Steinbach, V. Kumar. Introduction to Data Mining, Addison-Wesley, 2006.
    • E. Alpaydin. Introduction to Machine Learning, 2nd edition. MIT Press, 2010.
    • C. M. Bishop. Pattern Recognition and Machine Learning, Springer, 2007.
  • Grading: Final exam (30%) + assignments (70%)
  • TA: Mr. Sheng-Jun Huang and Mr. Qing Da
  • Final Exam: 10:00 - 12:00, Jan. 3, 2014, 仙1-206,207



Please follow the introductions in the TA page:

Assignment 1 (10%): Write a report on data mining applications         Due on 23:59:59 Sep. 24, 2013

Assignment 2 (5%): Implement a Decision Tree learning algorithm         Due on 23:59:59 Oct. 8, 2013

Assignment 3 (5%): Implement a Naive Bayes classification algorithm         Due on 23:59:59 Oct. 15, 2013

Assignment 4 (5%): Implement a Random Forest and an AdaBoost algorithms         Due on 23:59:59 Oct. 22, 2013

Assignment 5 (5%): Implement a k-Means clustering algorithm         Due on 23:59:59 Oct. 29, 2013

Assignment 6 (20%): Mining from a real-world data set (1)         Due on 23:59:59 Nov. 19, 2013

Assignment 7 (20%): Mining from a real-world data set (2)         Due on 23:59:59 Dec. 17, 2013


Schedule and Lecture slides

Sep. 17: Introduction (Download PDF) Reading material:
Z.-H. Zhou. Three perspectives of data mining. Artificial Intelligence, 2003, 143(1): 139-146.
H.-P. Kriegel, et al. Future trends in data mining. Data Mining and Knowledge Discovery, 2007, 15(1): 87-97.
Q. Yang and X. Wu. 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making, 2006, 5(4): 597-604.
Sep. 24: Data, Measurements, and Visualization (Download PDF) Reading material:
M. C. F. de Oliveira and H. Levkowitz. From visual data exploration to visual data mining: A survey. IEEE TVCG, 2003, 9(3): 378-394.
H. Liu, F. Hussain, C. L. Tan, and M. Dash. Discretization: An enabling technique. DMKD, 2002, 6(4): 393-423.
J. Dougherty, R. Kohavi, M. Sahami. Supervised and unsupervised discretization of continuous features. In Proceedings of ICML'95, 194-202, Tahoe City, CA.
X. Zhu and X. Wu. Class noise vs. attribute noise: A qualitative study of their impacts. AI Review, 2004, 22(3-4): 177-210.
Link: A javascript for simple data visualization
Machine Learning I : Decision Tree (Download PDF) Reading material:
Chapters 9 of Introduction to Machine Learning (E. Alpaydin, MIT Press, 2010).
R. Quinlan. Induction of decision trees. MLJ, 1:81-106, 1986.
Oct. 8: Machine Learning II : Principle of Learning (Download PDF) Reading material:
Chapter 2 of Introduction to Machine Learning (E. Alpaydin, MIT Press, 2010).
L. Valiant. A theory of the learnable. Communication of the ACM, 27(11):1134-1142, 1984.
Machine Learning III : Bayesian Classifiers (Download PDF) Reading material:
D. Heckerman. Bayesian networks for data mining. DMKD, 1997, 1(1): 79-119.
H. Zhang. The Optimality of Naive Bayes. FLAIRS Conference 2004.
F. Zheng and G. I. Webb. A Comparative Study of Semi-naive Bayes Methods in Classification Learning. In AusDM'05, 141-156.
Oct. 15: Machine Learning IV: Ensemble Methods (Download PDF)   Reading material:
L. Breiman. Random Forest. Machine Learning 45 (1): 5–32.
Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms, Boca Raton, FL: Chapman & Hall/CRC, 2012. (Chapter 2: Boosting).
E. Bauer and R. Kohavi. An Empirical Comparison of Voting Classi cation Algorithms: Bagging, Boosting, and Variants. Machine Learning, 1999, 36(1):105-139.
Oct. 22: Machine Learning V: Unsupervised learning (Download PDF) Reading material:
Chapter 8 and 7 of Introduction to Machine Learning (E. Alpaydin, MIT Press, 2010).
V. Estivill-Castro. Why so many clustering algorithms - a position paper. SIGKDD Explorations, 2002, 4(1): 65-75.
R. Xu and D. Wunsch II. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 2005, 16(3): 645-678.
C. Elkan. Using the Triangle Inequality to Accelerate k-Means. ICML'03, 147-153.
Oct. 29: Machine Learning VI: Linear Models (Download PDF) Reading material:
Chapters 3, 4, 6, and 7 of Pattern Recognition and Machine Learning (C. M. Bishop, Springer, 2007)
C. J. C. Burges. A tutorial on support vector machines for pattern recognition. DMKD, 1998, 2(2): 121-167.
K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE TNN, 2001, 12(2): 181-201.
Nov. 5: Machine Learning VII: Neural Networks and Nearest Neighbors (Download PDF) Reading material:
A. Roy. Artificial neural networks - A science in trouble. SIGKDD Explorations, 2000, 1(2): 33-38.
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313:504-507, 2006. A. Andoni and P. Indyk. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. CACM, 2008, 51(1): 117-121.
Experiment Design (Download PDF)
Nov. 12: Data Mining I: Feature Processing (Download PDF) Reading material:
A. L. Blum and P. Langley. Selection of relevant features and examples in machine learning. AIJ, 1997, 97(1-2): 245-271.
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 2003, 3: 1157-1182.
Chapter 6 of Introduction to Machine Learning (E. Alpaydin, MIT Press, 2010).
J. B. Tenenbaum, V. de Silva and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 2000, 290:2319-2322.
Nov. 19: Data Mining II: Handling Large Scale Data (Download PDF) Reading material:
M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. ACL'01.
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI'04.
B. Panda, et al. PLANET: Massively parallel learning of tree ensembles with MapReduce. VLDB'09.
J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 2002, 38(4):367-378.
J. Lin and A. Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD'12.
Nov. 26: Assignment Discussion
Dec. 3: Data Mining III: Mining Linkage Data (Download PDF) Reading material:
L. Getoor and C. Diehl. Link mining: A survey. SIGKDD Explorations, 7(2):3-12, 2005.
L. Page, et al. The PageRank citation ranking: Bringing order to the web. Technic report, 1997.
Dec. 10: Data Mining IV: Information Retrieval Systems (Download PDF) Reading material:
Chapter 14 of the text book (Principles of Data Mining)
M. Mitra, B. Chaudhuri. Information retrieval from documents: A survey. Information Retrieval 2000. M. Lew, N. Sebe, C. Djeraba, R. Jain. Content-based multimedia information retrieval: State of the art and challenges. TOMCCAP 2006.
D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV 2004.
Yoav Freund, R. Iyer, R.E. Schapire, Y. Singer. An Efficient Boosting Algorithm for Combining Preferences. JMLR
Dec. 17: Guest Lecture by Nan Li
Dec. 24: Assignment Discussion & Data Mining V: In Computer Vision Systems (Download PDF) Reading material:
P. Viola and M. Jones. Robust Real-time Object Detection, IJCV 2001.



  • Scikit-Learn An open source machine learning packags for Python.

  • KDnuggets A website for data mining resources.


Major academic venues

The end