Data Mining Practice |
Requirement |
Welcome to the real-world. For this assignment, you will have a practice on your data mining skills over a real-world task. You are supposed to conduct the mining task over the provided data set, and then submit your mining results and a technical report describing how you accomplished this task |
|
The Mining Task |
This is a real-world software mining task. You are supposed to recommend suitable emojis for the messages. You are free to use any data mining methods (either the existing methods or the novel ones you proposed) to make the recommendation as accurate as possible. |
|
Messaages are wildly used in modern human communication. Plenty of emojis are provided to express emotions better. However, it is not an easy work for a person to find the suitable emoji if (s)he is unfamiliar with the emojis (maybe you can think of how your parents or grandparents use emojis). Therefore, it is necessary to design an system to automatically recommend a suitable emoji for a message. This dataset is collected from real-world and contains many messages with corresponding emojis. By mining the relationship between messages and emojis, one may be able to construct a model to automatically recommend emojis. |
The data is split into two disjoint datasets, namely 'train dataset' and 'test dataset'. In this challenge you will get both of messages and emojis in training dataset and only messages in test dataset. See the ReadMe file in the dataset zip file on Kaggle for more details. |
Please download this file to see how to join the mining challenge. |
|
The Evaluation of the Assignments |
1. Performance Evaluation |
The evaluation service is offered by Kaggle. See the challenge page for details of evaluation and submission. You need to arrange your prediction result in a format described in sample_submission.csv and submit it through a web page. You prediction would be evaluated against mean F1-score. In the report you should introduce your method (if you use an existing method, you should explain why and how you use the method) and present (and analyze if possible) your result. You are requested to provide a runnable, well-trained model which can generate the final result you submit on kaggle(which means the highest score of the private data) and a brief manual of how to run your model(including the data format, the environment and anything else if necessary). We will repeat your result with your model. |
|
2. Technical Report Evaluation |
All the submitted reports will be evaluated against their content (the description of the used method), the insight (why decide to do so) they provided, the novelty (is the method novel or direct application), the organization (whether the contents are presented in a logical way), the language (whether the language is precise, concise and easy to follow) and the format (whether the format meets the requirement) of the papers.
|
Of course, high mean F1-score might lead to your high score. However, the score of your asemsignment does not depend only on the high values of the evaluation metric. A good report of your work is also important for your score. In a word, you should put your effort into both the performance of your model and the writing of your report. |
Please do NOT ask me any other information about the test data, it is a secret! |
If PLAGIARISM is identified, NO SCORES will be given to this report. |
Guidance and Suggestions Here are some guidance and suggestions that may help you to complete this assignment. |
What are the difficulties of this task? |
|
How to determine the method to use? |
|
How to Write the Report? |
|
How to Submit? |
|
For any questions and inquiries related to this assignment, please contact me or TA directly. |