Skip to content Skip to footer

Grocery Sales Prediction Using Time Series Forecasting ML Model

Forecasts aren’t just for meteorologists. Governments forecast economic growth. Scientists attempt to predict the future population. And businesses forecast product demand — a common task of professional data scientists. Forecasts are especially relevant to brick-and-mortar grocery stores, which must dance delicately with how much inventory to buy. Predict a little over, and grocers are stuck with overstocked, perishable goods. Guess a little under, and popular items quickly sell out, leading to lost revenue and upset customers. More accurate forecasting, thanks to machine learning, could help ensure retailers please customers by having just enough of the right products at the right time.

For grocery stores, more accurate forecasting can decrease food waste related to overstocking and improve customer satisfaction. The results of this case study, over time, might even ensure your local store has exactly what you need the next time you shop.


  1. Business problem
  2. Source of data
  3. ML problem
  4. EDA
  5. Featurization
  6. Model training
  7. Conclusion
  8. References
  9. Business Problem
  • To use time-series forecasting to forecast store sales on data from Corporacion Favorita, a large Ecuadorian-based grocery retailer.
  • To build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores.

2. Source of data

The data is available on the Kaggle-Mercari Price Suggestion Challenge.

3. ML Problem

The evaluation metric for this case study is Root Mean Squared Logarithmic Error.

4. Exploratory Data Analysis (EDA)

EDA is the most important part of doing any type of statistical or mathematical modelling. It provides a better understanding of data set variables and the relationships between them.

Holiday events

  • date — date of the holiday
  • type — type of the holiday
  • locale — whether it’s a Local or Regional holiday
  • local_name — name of the holiday
  • description — description of the holiday
  • transfered — whether the holiday is transfered to another date or not

It shows most of the holidays are National or Local. The Regional holidays are very less.


  • date — date
  • dcoilwtico — oil price for the given date

This shows the distribution of oil prices is more or less like a normal distribution with mean of approximately 45 and 100 for two curves.


  • store_nbr — the number of a store
  • city — city to which the store belongs to
  • state — state to which the store belongs to
  • type — type of store
  • cluster — cluster is a grouping of similar stores

This shows the stores data is considerably imbalanced with respect to city, type and clusters.


  • date—date of the transaction
  • store_nbr — store for which the transaction is made
  • transactions — numbre of transactions

Train data

  • id — train id
  • date — date of the record
  • store_nbr — store number for which the sales is calculated
  • family — family to which the store belongs to
  • sales — number of sales
  • onpromotion — gives the total number of items in a product family that were being promoted at a store at a given date

The PDF of sales has a peak at around 350 and the curve heals towards right for higher values.

The families of the store number are highly balanced. It means for each family there are considerable amount of stores.

The number of sales for each family seems to have high imbalance in it.

5. Featurization


These are the start and end dates for train and test datasets.


Train and test datasets are concatenated and then they are merged with orig_stores to obtain store information.

Holiday events

Final dataset is merged with the orig_holidays_events based on the correct columns for each type of holidays. Unnecessary columns have been removed. Adding easter holidays and closure days to the final dataset based on the hard-coded dates.


The rolling mean averages of oil prices of last week, last 2 weeks, and a month are calculated for each date for which oil price is given. Then this averages are merged with the final dataset.


Transactions dataframe is grouped by for getting the first transaction happened on each date. Then additional transactions are calculated as 16, 21, 30, 60 days back of the actual dates to understand the trend so that model will be able to forecast the future sales.

Time features

Some time related features are calculated based on date available.

Resetting the index and filling some null values

Indexes has been reset to store number, family and date. This is necessary so that for each store number and it’s family, a seperate ML model can be trained and prediction can be done more accurately.
Some null values have been replaced with 0.

6. Model training

These hyper parameters are selected by doing some trial and error till getting good results, as it is not possible to do hyperparameter tuning for each model on colab notebook.

A Random Forest Regressor is trained on each of the family of stores with these hyper parameters and with appropriate weights and predictions are made on the test data.

7. Conclusion

The RMSLE score obtained from the above model was 0.487 which is good enough to stand at the kaggle leader board.

8. Future scope

Different ML models like XGBoost Regressor, SVM, Linear Regression, Decision tree can be tried out to check if it improves the score.

9. References

Store Sales – Time Series Forecasting

Use machine learning to predict grocery sales