SIGIR eCom

Data Challenge Checklist

Registration is now closed!

Join the data challenge slack channel: DC Slack

Click here for the final Scoreboards

About

Rakuten Multi-modal Product Data Classification and Retrieval challenge is organized by Rakuten Institute of Technology, the research and innovation department of Rakuten group. This challenge focuses on two topics, namely large-scale multi-modal (text and image) classification and cross-modal retrieval. The goal of the multi-modal classification task is to predict each product’s 'type code' as defined in the catalog of Rakuten France. In the cross-modal retrieval task, presented with the text of the products, the goal is to retrieve the images corresponding to the products.

The cataloging of product listings through some type of text or image categorization method is a fundamental problem for any e-commerce marketplace, with applications ranging from personalized search and recommendations to query understanding. Manual and rule-based approaches to categorization are not scalable since commercial products are organized in many and sometimes thousands of classes. When actual users categorize product data, it has often been observed that not only the text of the title and description of the product is useful but also its associated images.

Challenges

In the taxonomy of Rakuten France, products sharing the same product type code share the same exact array of attributes fields and possible values. Product type codes are numbers that match a generic product name, such as 1500 is Watches, 120 is Laptops, and so on. In that sense, the type code of a product is its category label.

In the product catalog of Rakuten France, a product with a French title Klarstein Présentoir 2 Montres Optique Fibre is associated with an image and sometimes with an additional description. This product is categorized with a product type code of 1500, signifying watches. There are other products with different titles, images and with possible descriptions, which are under the same product type code.

Given these information on the products, like the example above, this challenge proposes participating teams build and submit systems that classify previously unseen products into their corresponding product type codes.

The main tasks for this challenge are as follows:

Multi-modal classification. Given a training set of products and their product type codes, predict the corresponding product type codes for an unseen held out test set of products. The systems are free to use the available textual titles and/or descriptions whenever available and additionally the images to allow for true multi-modal learning.
Cross-modal retrieval. Given an held-out test set of product items with their titles and (possibly empty) descriptions, predict the best image from among a set of test images that correspond to the products in the test set.

Data

For this challenge, Rakuten France has released approximately 99K product listings in tsv format, including a training (84,916) and test set (13,812). The dataset consists of product titles, product descriptions, product images and their corresponding product type codes. The test set will be released towards the end of the data challenge. Furthermore, one can assume the test set has been generated from the same data distribution as the training set.

A detailed description of the data is included in this pdf file.

Participation and Submission

~~Please register in this link for participation.~~ The submission is team-based, only team leader can submit the prediction file. Also the same person cannot be team leader of multiple teams. There is no limit on maximum team size. We will send you the details about downloading the data file after we receive your sign-up.

Participants need to provide a prediction output file in the same format as the training output file (associating each Integer_id, Image_id and Product_id tuple with the predicted Prdtypecode). The first line of this test output file should contain the header. Please DO NOT change the order of the test titles in your submission file.

A detailed instruction of the format of the submissions is included in this pdf file.

Accepted contributions will be presented during the eCom Workshop in SIGIR 2020.

Evaluation Metric

Since in this challenge, we are dealing with many classes with highly asymmetric number of samples, an item weighted metric used to rank the participants will not reveal the deficiencies of the classification algorithms. The evaluation script is included in the downloadable data file.

Task 1: We will use the macro-F1 score to evaluate product type code classification on held out test samples. The score is understood as the arithmetic average of per product type code F1 score.
Task 2: For the cross-modal retrieval task, the systems will be evaluated on recall at 1 (R@1) on held out test samples. The score is understood to be the average of the per-sample scoring of 1 if the image returned matches the title and 0 otherwise.

Stage 1 - Model Building (April 20 - July 15)

Participants build and test models on the training data. The leaderboard only shows the model performance on a SUBSET of the test set according to your LATEST submission. Each team can submit at most 4 times per day in this stage. The leaderboard freezes on July 15 at 5 PM PDT i.e. UTC 00:00:00 the next day.

Stage 2 - Model Evaluation (July 15 - July 23)

The final leaderboard will freeze on July 23 at 5 PM PDT i.e. UTC 00:00:00 the next day, and show the model performance on the remaining held out test set according to your LATEST submission. In this stage each team can submit at most 8 times during the time period that the evaluation is open, and there must be a period of 24 hours between two submissions.

System Description Paper

System description papers will be peer reviewed (single-blind) by the program committee. All submissions must be formatted according to the latest ACM SIG proceedings template. There will be no specific constraint on the content but it should cover the implementation details, such as data preprocessing, including token normalization and feature extraction, additional data used from external sources; model descriptions, including specific implementations, parameter tuning, etc. and error analysis, if any. Suggested paper length is 2-4 pages, and parameter tuning settings or similar information can be moved to an Appendix section.

Submissions to SIGIR eCom should be made at https://easychair.org/my/conference?conf=sigirecom20dc

Instructions to submit the system description paper should be available soon. The deadline for paper submission is July 17, 2020 (11: 59 P.M. UTC).

Timeline

April 20	Registration Opens, Evaluation Stage 1 Starts!
July 15	Leaderboard for Stage 1 Closes (Registration closes)
July 15	Evaluation Stage 2 Starts
July 23	Evaluation Stage 2 Closes (Final Leaderboard)
July 30	eCom Full Day Workshop

Data Challenge Paper Submission Timeline

July 17	System Description Paper Submission (Suggested paper length 2-4 pages with separate Appendix)
July 23	Paper Acceptance Notification
July 27	Camera Ready Version of Papers Due (Updated with your final methodology and results)

If you have any question, please contact Hesam Amoualian (hesam.amoualian@rakuten.com) or Parantapa Goswami (parantapa.goswami@rakuten.com).

Phase 1 Scoreboard
Task 1: Multimodal Classification

Rank	Team Name	Last Submission	Macro-F1 score
1	Transformers	2020 Jul 13 06:53:43	91.94
2	zenit84	2020 Jun 27 17:33:34	91.63
3	Alto	2020 Jul 14 15:40:57	91.63
4	Beantown	2020 Jul 15 23:47:03	90.89
5	Synerise AI	2020 Jul 07 13:45:38	89.72
6	pa_curis	2020 Jul 15 17:38:26	89.65
7	RIT-Paris Baseline	2020 Jul 15 10:48:14	87.05
8	tester	2020 Jun 24 17:53:50	86.94
9	testers	2020 Jul 08 16:19:24	85.87
10	MMG_AI_TEAM	2020 Jul 15 11:27:08	84.81
11	DeepData	2020 Jun 18 09:11:58	84.32
12	overfiTTers	2020 Jul 12 12:57:20	81.9
13	Team MLG	2020 Jul 15 05:01:11	65.8
14	qrudraksh	2020 May 27 20:03:26	58.0
15	7ate9	2020 Jun 10 04:39:36	53.29

Phase 2 Scoreboard
Task 1: Multimodal Classification

Rank	Team Name	Last Submission	Macro-F1 score
1	pa_curis	2020 Jul 21 10:25:24	91.44
2	Alto	2020 Jul 23 21:35:59	90.87
3	Transformers	2020 Jul 23 16:23:52	90.53
4	zenit84	2020 Jul 23 22:22:36	90.39
5	Beantown	2020 Jul 22 03:58:44	90.22
6	Synerise AI	2020 Jul 17 04:42:21	89.78
7	MMG_AI_TEAM	2020 Jul 22 13:22:00	86.94
8	RIT-Paris Baseline	2020 Jul 18 09:19:53	85.36
9	Team MLG	2020 Jul 17 01:22:23	64.48

Task 2: Cross-modal Retrieval

Rank	Team Name	Last Submission	Recall@1 score
1	Synerise AI	2020 Jul 01 12:30:48	50.23
2	changer	2020 Jul 12 11:14:59	46.85
3	pa_curis	2020 May 27 09:34:52	41.89
4	Beantown	2020 Jul 15 20:38:29	38.96
5	Alto	2020 Jul 03 01:49:35	38.29
6	MMG_AI_TEAM	2020 Jul 15 11:29:07	27.25
7	kenneth	2020 Jun 16 20:12:10	1.35

Task 2: Cross-modal Retrieval

Rank	Team Name	Last Submission	Recall@1 score
1	Synerise AI	2020 Jul 17 19:04:22	34.28
2	changer	2020 Jul 21 16:59:52	31.93
3	Beantown	2020 Jul 22 17:05:28	23.3
4	Alto	2020 Jul 20 23:40:02	19.99
5	pa_curis	2020 Jul 23 13:19:58	19.74
6	MMG_AI_TEAM	2020 Jul 20 12:39:02	15.77

Data Challenge Organizers

Hesam Amoualian Rakuten Institute of Technology, Paris
Parantapa Goswami   Rakuten Institute of Technology, Paris
Laurent Ach   Rakuten Institute of Technology, Paris
Pradipto Das Rakuten Institute of Technology, Americas
Pablo Montalvo   Rakuten Institute of Technology, Paris

Rakuten Data Challenge

Multimodal Product Classification and Retrieval

Data Challenge Checklist

Hesam Amoualian Rakuten Institute of Technology, Paris
Parantapa Goswami   Rakuten Institute of Technology, Paris
Laurent Ach   Rakuten Institute of Technology, Paris
Pradipto Das Rakuten Institute of Technology, Americas
Pablo Montalvo   Rakuten Institute of Technology, Paris

Data Challenge Checklist

Hesam Amoualian Rakuten Institute of Technology, Paris Parantapa Goswami ​Rakuten Institute of Technology, Paris Laurent Ach ​Rakuten Institute of Technology, Paris Pradipto Das Rakuten Institute of Technology, Americas Pablo Montalvo ​Rakuten Institute of Technology, Paris

Hesam Amoualian Rakuten Institute of Technology, Paris
Parantapa Goswami Rakuten Institute of Technology, Paris
Laurent Ach Rakuten Institute of Technology, Paris
Pradipto Das Rakuten Institute of Technology, Americas
Pablo Montalvo Rakuten Institute of Technology, Paris