Rakuten Multi-modal Product Data Classification and Retrieval challenge is organized by Rakuten Institute of Technology, the research and innovation department of Rakuten group. This challenge focuses on two topics, namely large-scale multi-modal (text and image) classification and cross-modal retrieval. The goal of the multi-modal classification task is to predict each product’s 'type code' as defined in the catalog of Rakuten France. In the cross-modal retrieval task, presented with the text of the products, the goal is to retrieve the images corresponding to the products.
The cataloging of product listings through some type of text or image categorization method is a fundamental problem for any e-commerce marketplace, with applications ranging from personalized search and recommendations to query understanding. Manual and rule-based approaches to categorization are not scalable since commercial products are organized in many and sometimes thousands of classes. When actual users categorize product data, it has often been observed that not only the text of the title and description of the product is useful but also its associated images.
ChallengesIn the taxonomy of Rakuten France, products sharing the same product type code share the same exact array of attributes fields and possible values. Product type codes are numbers that match a generic product name, such as 1500 is Watches, 120 is Laptops, and so on. In that sense, the type code of a product is its category label.
In the product catalog of Rakuten France, a product with a French title Klarstein Présentoir 2 Montres Optique Fibre is associated with an image and sometimes with an additional description. This product is categorized with a product type code of 1500, signifying watches. There are other products with different titles, images and with possible descriptions, which are under the same product type code.
Given these information on the products, like the example above, this challenge proposes participating teams build and submit systems that classify previously unseen products into their corresponding product type codes.
The main tasks for this challenge are as follows:
For this challenge, Rakuten France has released approximately 99K product listings in tsv format, including a training (84,916) and test set (13,812). The dataset consists of product titles, product descriptions, product images and their corresponding product type codes. The test set will be released towards the end of the data challenge. Furthermore, one can assume the test set has been generated from the same data distribution as the training set.
A detailed description of the data is included in this pdf file.
Participation and SubmissionPlease register in this link for participation. The submission is team-based, only team leader can submit the prediction file. Also the same person cannot be team leader of multiple teams. There is no limit on maximum team size. We will send you the details about downloading the data file after we receive your sign-up.
Participants need to provide a prediction output file in the same format as the training output file (associating each Integer_id, Image_id and Product_id tuple with the predicted Prdtypecode). The first line of this test output file should contain the header. Please DO NOT change the order of the test titles in your submission file.
A detailed instruction of the format of the submissions is included in this pdf file.
Accepted contributions will be presented during the eCom Workshop in SIGIR 2020.
Evaluation MetricSince in this challenge, we are dealing with many classes with highly asymmetric number of samples, an item weighted metric used to rank the participants will not reveal the deficiencies of the classification algorithms. The evaluation script is included in the downloadable data file.
Participants build and test models on the training data. The leaderboard only shows the model performance on a SUBSET of the test set according to your LATEST submission. Each team can submit at most 4 times per day in this stage. The leaderboard freezes on July 15 at 5 PM PDT i.e. UTC 00:00:00 the next day.
Stage 2 - Model Evaluation (July 15 - July 23)The final leaderboard will freeze on July 23 at 5 PM PDT i.e. UTC 00:00:00 the next day, and show the model performance on the remaining held out test set according to your LATEST submission. In this stage each team can submit at most 8 times during the time period that the evaluation is open, and there must be a period of 24 hours between two submissions.
System Description PaperSystem description papers will be peer reviewed (single-blind) by the program committee. All submissions must be formatted according to the latest ACM SIG proceedings template. There will be no specific constraint on the content but it should cover the implementation details, such as data preprocessing, including token normalization and feature extraction, additional data used from external sources; model descriptions, including specific implementations, parameter tuning, etc. and error analysis, if any. Suggested paper length is 2-4 pages, and parameter tuning settings or similar information can be moved to an Appendix section.
Instructions to submit the system description paper should be available soon. The deadline for paper submission is July 17, 2020 (11: 59 P.M. UTC).
TimelineApril 20 | Registration Opens, Evaluation Stage 1 Starts! |
July 15 | Leaderboard for Stage 1 Closes (Registration closes) |
July 15 | Evaluation Stage 2 Starts |
July 23 | Evaluation Stage 2 Closes (Final Leaderboard) |
July 30 | eCom Full Day Workshop |
July 17 | System Description Paper Submission (Suggested paper length 2-4 pages with separate Appendix) |
July 23 | Paper Acceptance Notification |
July 27 | Camera Ready Version of Papers Due (Updated with your final methodology and results) |
If you have any question, please contact Hesam Amoualian (hesam.amoualian@rakuten.com) or Parantapa Goswami (parantapa.goswami@rakuten.com).
Rank | Team Name | Last Submission | Macro-F1 score |
---|---|---|---|
1 | Transformers | 2020 Jul 13 06:53:43 | 91.94 |
2 | zenit84 | 2020 Jun 27 17:33:34 | 91.63 |
3 | Alto | 2020 Jul 14 15:40:57 | 91.63 |
4 | Beantown | 2020 Jul 15 23:47:03 | 90.89 |
5 | Synerise AI | 2020 Jul 07 13:45:38 | 89.72 |
6 | pa_curis | 2020 Jul 15 17:38:26 | 89.65 |
7 | RIT-Paris Baseline | 2020 Jul 15 10:48:14 | 87.05 |
8 | tester | 2020 Jun 24 17:53:50 | 86.94 |
9 | testers | 2020 Jul 08 16:19:24 | 85.87 |
10 | MMG_AI_TEAM | 2020 Jul 15 11:27:08 | 84.81 |
11 | DeepData | 2020 Jun 18 09:11:58 | 84.32 |
12 | overfiTTers | 2020 Jul 12 12:57:20 | 81.9 |
13 | Team MLG | 2020 Jul 15 05:01:11 | 65.8 |
14 | qrudraksh | 2020 May 27 20:03:26 | 58.0 |
15 | 7ate9 | 2020 Jun 10 04:39:36 | 53.29 |
Rank | Team Name | Last Submission | Macro-F1 score |
---|---|---|---|
1 | pa_curis | 2020 Jul 21 10:25:24 | 91.44 |
2 | Alto | 2020 Jul 23 21:35:59 | 90.87 |
3 | Transformers | 2020 Jul 23 16:23:52 | 90.53 |
4 | zenit84 | 2020 Jul 23 22:22:36 | 90.39 |
5 | Beantown | 2020 Jul 22 03:58:44 | 90.22 |
6 | Synerise AI | 2020 Jul 17 04:42:21 | 89.78 |
7 | MMG_AI_TEAM | 2020 Jul 22 13:22:00 | 86.94 |
8 | RIT-Paris Baseline | 2020 Jul 18 09:19:53 | 85.36 |
9 | Team MLG | 2020 Jul 17 01:22:23 | 64.48 |
Rank | Team Name | Last Submission | Recall@1 score |
---|---|---|---|
1 | Synerise AI | 2020 Jul 01 12:30:48 | 50.23 |
2 | changer | 2020 Jul 12 11:14:59 | 46.85 |
3 | pa_curis | 2020 May 27 09:34:52 | 41.89 |
4 | Beantown | 2020 Jul 15 20:38:29 | 38.96 |
5 | Alto | 2020 Jul 03 01:49:35 | 38.29 |
6 | MMG_AI_TEAM | 2020 Jul 15 11:29:07 | 27.25 |
7 | kenneth | 2020 Jun 16 20:12:10 | 1.35 |
Rank | Team Name | Last Submission | Recall@1 score |
---|---|---|---|
1 | Synerise AI | 2020 Jul 17 19:04:22 | 34.28 |
2 | changer | 2020 Jul 21 16:59:52 | 31.93 |
3 | Beantown | 2020 Jul 22 17:05:28 | 23.3 |
4 | Alto | 2020 Jul 20 23:40:02 | 19.99 |
5 | pa_curis | 2020 Jul 23 13:19:58 | 19.74 |
6 | MMG_AI_TEAM | 2020 Jul 20 12:39:02 | 15.77 |