Data Mining Practice   

Requirement

    Welcome to the real-world. For this assignment, you will have a practice on your data mining skills over a real-world task. You are supposed to conduct the mining task over the provided data set, and then submit your mining results and a technical report describing how you accomplished this task 

  • Deadline for uploading the results: 23:59, Jun. 15  (Saturday), 2019

  • Reports for uploading the manuscript: 23:59, June. 18  (Tuesday), 2019

  • Scoring: This assignment contributes 40% of the your final score

 
The Mining Task

This is a real-world software mining task. You are supposed to recommend suitable emojis for the messages. You are free to use any data mining methods (either the existing methods or the novel ones you proposed) to make the recommendation as accurate as possible.

  • The story of the data

Messaages are wildly used in modern human communication. Plenty of emojis are provided to express emotions better. However, it is not an easy work for a person to find the suitable emoji if (s)he is unfamiliar with the emojis (maybe you can think of how your parents or grandparents use emojis). Therefore, it is necessary to design an system to automatically recommend a suitable emoji for a message. This dataset is collected from real-world and contains many messages with corresponding emojis. By mining the relationship between messages and emojis, one may be able to construct a model to automatically recommend emojis.

  • The data to be mined

The data is split into two disjoint datasets, namely 'train dataset' and 'test dataset'. In this challenge you will get both of messages and emojis in training dataset and only messages in test dataset. See the ReadMe file in the dataset zip file on Kaggle for more details.

Please download this file to see how to join the mining challenge.

 

 
The Evaluation of the Assignments

1. Performance Evaluation

The evaluation service is offered by Kaggle. See the challenge page for details of evaluation and submission. You need to arrange your prediction result in a format described in sample_submission.csv and submit it through a web page. You prediction would be evaluated against mean F1-score.

In the report you should introduce your method (if you use an existing method, you should explain why and how you use the method) and present (and analyze if possible) your result. You are requested to provide a runnable, well-trained model which can generate the final result you submit on kaggle(which means the highest score of the private data) and a brief manual of how to run your model(including the data format, the environment and anything else if necessary). We will repeat your result with your model.

 

2. Technical Report Evaluation

      All the submitted reports will be evaluated against their content (the description of the used method), the insight (why decide to do so) they provided, the novelty (is the method novel or direct application), the organization (whether the contents are presented in a logical way), the language (whether the language is precise, concise and easy to follow) and the format (whether the format meets the requirement) of the papers.

 

Of course, high mean F1-score might lead to your high score. However, the score of your asemsignment does not depend only on the high values of the evaluation metric. A good report of your work is also important for your score. In a word,  you should put your effort into both the performance of your model and the writing of your report.

Please do NOT ask me any other information about the test data, it is a secret!

If PLAGIARISM is identified, NO SCORES will be given to this report.
 

Guidance and Suggestions

Here are some guidance and suggestions that may help you to complete this assignment.

 

What are the difficulties of this task?

  • The provided data is in the form of characters:  The data includes Chinese characters, numbers, English alphabet or many other kinds of characters. You may try using utf_8 or utf_8_sig encoding to decode them. After decoding them, you need to convert the message to feature vectors before constructing a model. You need to carefully design how to extract a useful feature with respect to the target of the mining task.

  • The training data is multi-classed: The training data is multi-classed. There are up to 72 emojis in total. You cannot only forcus on those most frequent classes because the evaluation score is mean F1-score which means all classes are equally important.

 

How to determine the method to use?

  • Using existing method.  You may try some existing techniques for text mining, such as vector space model, kNN, SVM, Naive Bayes,deep models, etc. Also you may use the methods introduced in the lecture.

  • Proposing novel method. You may try to propose your novel method based on several existing methods in order to consider the characteristics of the provided data. You are highly encouraged to do so!

How to Write the Report?

  • The content of your report should consist of but should not be limited to the following points.
    (1) Brief description of your understanding of this mining task.
    (2) The motivation for selecting this specific method.
    (3) The description of the technical details of this method
    (4) Comparative study with other two baseline methods to show it is appropriate for this task.
    (5) Discussions on the method and/or results, and what conclusions can be drawn.
  • You should organize your manuscript in a technical-paper fashion, which is consists of title, abstract, keywords, main body, and references. Please keep the style and format as the same as those in Chinese Journal of Software.
  • Plagiarism will be seriously penalized. You may refer to IEEE's "Five Levels of Plagiarism" for its definition.

 

How to Submit?

  • Please submit your report before the report deadline from the web. [submission link]

  • Please submit your results through the Result Submission Sit on Kaggle. The submitted result file should be csv. No other format is acceptable.

  • The content of your submitted result file should follow the format of the sample submission file. Submissions with incorrect format will be not be accepted.

  • NO submission after the deadline is acceptable!

  • NO email submission will be accepted!


For any questions and inquiries related to this assignment, please contact me or TA directly.