Skip to content Skip to footer

Mercari Price Suggestion Challenge: A Machine Learning Case Study

Suggesting a reasonable price for the product is a very important task for all online shopping organizations. Mercari is one such online selling app powered by one of the biggest communities of Japan where users can sell their used multiple categories of products. It is similar to Quikr in India.

The community wants to offer the price suggestion to sellers so that they will be able to know what the price or value of the product is before they sell it. This is not an easy task because the sellers are allowed to put any category of products or any bundle of things on the app, and the price of the product depends on many factors like brand name, usage of the product, condition of the product, how old the product is, etc.

Contents

  1. Business problem
  2. Source of data
  3. ML problem
  4. Data cleaning
  5. Existing approaches
  6. Improvements
  7. EDA
  8. Featurization
  9. First cut solution
  10. Model explanation
  11. Conclusion
  12. Future work
  13. References

1. Business problem

The Mercari community wants to build a machine learning model that should be able to suggest the right price of the product to the seller depending upon the attributes provided by the seller.

The attributes are product name, product description, brand name, category, item condition, shipping information (whether shipping charge will be given by seller or buyer).

2. Source of data

The data is available on the Kaggle-Mercari Price Suggestion Challenge.
The dataset consists of the following features.

  • train_id — the id of the listing
  • name — the title of the listing
  • item_condition_id — the condition of the items provided by the seller. From 1 to 5. 1 being ‘new’ and 5 being ‘poor’
  • category_name — category of the listing
  • brand_name
  • price — the price that the item was sold for. This is the target variable that we will predict. The unit is USD.
  • shipping — 1 if the shipping fee is paid by seller and 0 by buyer
  • item_description — the full description of the item.

3. ML problem

As the target variable is a real-valued number, the given problem is a regression problem.

The metric used for evaluating a model is Root Mean Squared Logarithmic Error (RMSLE).

4. Data cleaning

Handling missing values

It is necessary to check if there are any null values in the dataset. It is clear from above that the brand_name, category_name, and item_description features have null values. These null values have to be replaced by appropriate substitution. Here I have replaced category_name and brand_name with the word ‘missing’, and item_description by ‘No description yet’. There are other methods also for replacing these null values like by replacing with the most occurring value in that feature.

Read More