Rakuten Data Challenge

Taxonomy Classification for eCommerce-scale Product Catalogs


About


The SIGIR eCom Data Challenge is organized by Rakuten Institute of Technology Boston (RIT-Boston), a dedicated R&D organization for the Rakuten group. This challenge focuses on the topic of large-scale taxonomy classification where the goal is to predict each product’s category as defined in the taxonomy tree given product's title. The cataloging of product listings through taxonomy categorization is a fundamental problem for any e-commerce marketplace, with applications ranging from personalized search recommendations to query understanding. For example, in the Rakuten.com catalog, “Dr. Martens Air Wair 1460 Mens Leather Ankle Boots” is categorized under the “Clothing, Shoes & Accessories -> Shoes -> Men -> Boots” leaf. However, manual and rule based approaches to categorization are not scalable since commercial product taxonomies are organized in tree structures with three to ten levels of depth and thousands of leaf nodes. Advances in this area of research have been limited due to the lack of real data from actual commercial catalogs. The challenge presents several interesting research aspects due to the intrinsic noisy nature of the product labels, the size of modern eCommerce catalogs, and the typical unbalanced data distribution.

Participation and Data

As part of this challenge, Rakuten will be releasing 1M product listings in tsv format, including the train (0.8M) and test set (0.2M), consisting of product titles and their corresponding category ID paths. The followings are some examples from the training set,
Title CategoryIdPath
Replacement Viewsonic VG710 LCD Monitor 48Watt AC Adapter 12V 4A 3292>114>1231
Ka-Bar Desert MULE Serrated Folding Knife 4238>321>753>3121
5.11 TACTICAL 74280 Taclite TDU Pants, R/M, Dark Navy 4015>3285>1443>20
Skechers 4lb S Grip Jogging Weight set of 2- Black 2075>945>2183>3863

The test set contains only the title field and the goal is to predict the CategoryIdPath for each title. Please sign up the form for participation (sign up here). Accepted contributions will be presented during the eCom Workshop in SIGIR 2018.

Submission

The submission is team-based, so only team leader can submit the prediction file. We will send you the detail about file submission after we receive your sign-up. There is no limit on maximum team size. The prediction file has to be the same tsv format as the training file where the first column is the product title from the test set and the second column is your predicted CategoryIdPath. Please DO NOT change the order of the test titles in your submission file.

Evaluation (Updated!)

The evaluation metrics will be weighted-{precision, recall, F1} (reference) on the test set of EXACT CategoryIdPath match. Since the product distribution over the taxonomy tree is highly imbalanced, weighted-{precision, recall, F1} make much more sense than macro- or micro- {precision, recall, F1} do. Please note that partial path match does not count as a correct prediction. Evaluation script is provided here and the usage is shown below. Both PREDCITION_FILE and GOLD_FILE must be in the same tsv format as the training file where the first column is product title and the second column is CategoryIdPath. The title order must be the same in both files. $ python eval.py -pred $PREDCITION_FILE -gold $GOLD_FILE Stage 1 - Model Building (April 9 - June 23)
Participants build and test models on the training data. The leaderboard only shows the model performance on a SUBSET of the test set according to your LATEST submission. Each team can submit at most 3 times per day (UTC time) in this stage and the leader board will update every 15 minutes.

Stage 2 - Model Evaluation (June 24)
The final leaderboard will freeze on June 24 and show the model performance on the ENTIRE test set according to your LATEST submission.

System Description Paper (New!)

System description papers will be peer reviewed (single-blind) by the program committee. All submissions must be formatted according to the latest ACM SIG proceedings template available at http://www.acm.org/publications/proceedings-template (LaTeX users use sample-sigconf.tex as a template). There will be no specific constraint on the content but it should cover the implementation details, such as data preprocessing, including token normalization and feature extraction, additional data used from external sources; model descriptions, including specific implementations, parameter tuning, etc. and error analysis, if any.

Submissions of system description paper should be made at https://easychair.org/conferences/?conf=ecom18dc. The deadline for paper submission is June 8, 2018 (11: 59 P.M. UTC).

Timeline (Updated!)

When ? What ?
April 09 Evaluation Stage 1 Starts!
May 15 Data Challenge Registration Deadline
June 08 System Description Paper Submission (Suggested paper length 4-8 pages)
June 15 Paper Acceptance Notification
June 24 Evaluation Stage 2 (Final Leaderboard)
July 06 Camera Ready Version of Papers Due
July 12 eCom Full Day Workshop
Note: The timeline is subject to slight modifications.
If you have any question, please contact Yiu-Chang Lin (yiuchang.lin@rakuten.com).

Data Challenge Checklist

  • Register for the data challenge: Registration now open!

  • Get download links for the dataset - 1 link each for training and testing data (You will receive an email with these links within 24 of hrs registration)

  • Send team details to Yiu-Chang Lin - yiuchang.lin@rakuten.com (this is necessary to receive a team specific submission link)
  • Receive Test set submission link for your team (You will receive an email with the link within 24 hrs of sending team details)

  • Join the data challenge slack channel (email with details will be sent after you register)