Suggesting a reasonable price for the product is a very important task for all online shopping organizations. Mercari is one such online selling app powered by one of the biggest communities of Japan where users can sell their used multiple categories of products. It is similar to Quikr in India.
The community wants to offer the price suggestion to sellers so that they will be able to know what the price or value of the product is before they sell it. This is not an easy task because the sellers are allowed to put any category of products or any bundle of things on the app, and the price of the product depends on many factors like brand name, usage of the product, condition of the product, how old the product is, etc.
- Business problem
- Source of data
- ML problem
- Data cleaning
- Existing approaches
- First cut solution
- Model explanation
- Future work
1. Business problem
The Mercari community wants to build a machine learning model that should be able to suggest the right price of the product to the seller depending upon the attributes provided by the seller.
The attributes are product name, product description, brand name, category, item condition, shipping information (whether shipping charge will be given by seller or buyer).
2. Source of data
The data is available on the Kaggle-Mercari Price Suggestion Challenge.
The dataset consists of the following features.
- train_id — the id of the listing
- name — the title of the listing
- item_condition_id — the condition of the items provided by the seller. From 1 to 5. 1 being ‘new’ and 5 being ‘poor’
- category_name — category of the listing
- price — the price that the item was sold for. This is the target variable that we will predict. The unit is USD.
- shipping — 1 if the shipping fee is paid by seller and 0 by buyer
- item_description — the full description of the item.
3. ML problem
As the target variable is a real-valued number, the given problem is a regression problem.
The metric used for evaluating a model is Root Mean Squared Logarithmic Error (RMSLE).
4. Data cleaning
Handling missing values
It is necessary to check if there are any null values in the dataset. It is clear from above that the brand_name, category_name, and item_description features have null values. These null values have to be replaced by appropriate substitution. Here I have replaced category_name and brand_name with the word ‘missing’, and item_description by ‘No description yet’. There are other methods also for replacing these null values like by replacing with the most occurring value in that feature.