Pricing with Machine Learning

7 min readApr 12, 2023
Img Source:


I’ve had the opportunity to work with Credit, Collection, and Pricing. Among these areas, I found that Pricing seems to have the least amount of sources that leverage data-driven strategies. In this article, I want to share with you how Machine Learning can help your company improve its pricing models, along with some additional data-driven ideas that can optimize your strategies even further.

Let’s dive in and see how we can take your pricing game to the next level!


To get started, you need to download the Olist E-commerce dataset from Kaggle, which you can find here. As a Data Science student, I love how comprehensive this dataset is and the many things you can do with it, including geospatial analysis, clustering, data cleaning, and more. We will be using it to price products, so I will begin by loading the necessary libraries and the dataset.

We can consolidate our various sources to create a dataset that includes orders and their relevant information:

You will see what the data looks like in the output. I won’t print it here because we wouldn’t be able to see much, as the many ID fields make the cells very large.

Let’s also check the shape of the dataset:

And the columns:

When predicting the price of a product, many types of information may seem irrelevant, such as product IDs, while others may be debatable, like freight costs. Although customers may consider freight costs when buying a product, we can’t know for sure how much they will affect the price. As the focus of this article is only to explore how ML can enhance pricing models, I won’t be considering freight costs as a significant factor.

When selecting features for a model, it’s important to note that linear correlation isn’t the only factor to consider. Combining multiple variables can represent another variable, even if there isn’t a high linear correlation among two of them. Moreover, nonlinear relationships won’t be captured in this type of analysis. Given these nuances, it’s worth exploring whether any features have a high correlation with the target variable.

As evident from the plot, the correlations appear to be generally low, with limited insights to glean. However, it is intriguing to note that the description length exhibits a relatively higher correlation compared to other features. It is important to acknowledge that there may be several other relevant features that could potentially impact the price, such as the material used in product manufacturing, to name just one example.

I won’t go into much detail about feature engineering in this article, but there are a few strategies worth exploring in practice. For instance, transforming skewed features can improve pattern recognition. Also, there are many outliers in our data. Removing them is a highly debated topic in the field, but you can choose to do so. Just make sure to exclude the test set from any outlier removal, since they will still be present when the model is put into production and we need to assess how the model perform will perform after training.

Now, as we move forward with Machine Learning, it’s important to separate our categorical and numerical features since they require different strategies. Additionally, let’s split our data into training and testing sets to ensure that our model is reliable and can perform well on unseen data. Don’t worry, we’ll also use Cross-Validation and I advise you to also use out-of-time data for validation.

Now it’s time to create our pipeline. As we’ll be using a tree-based model, One-Hot Encoding might not be the best option to treat categorical variables. Instead, we’ll use Target Encoder. For the numerical features, scaling might not make a difference for this type of model, but we’ll do it anyway so that we can train different models, even those that are affected by scaling. We’ll use XGBoost Regressor as our algorithm of choice, as I find its sequential learning approach to be particularly effective for predictions, which is our primary goal here.

The Cross-Validation:

And the validation metrics:

As you can see, our model isn’t overfitting. The scores outputs goes from 11145.933 to 13996.018. which is fine and closer to our final MSE, 14252.017. If you are running this model in your pc, you won’t see the same results, but I guess it’s pretty close.

I also like to check the scatterplot of Actual vs Predicted:

This is the Actual vs Predicted graph:

This model’s performance is moderate and there is still room for improvement. We can try feature engineering, removing outliers, and tuning hyperparameters, among other things. However, we should keep in mind that the predictive power of our current features may be limited and additional information may be necessary. If you’re implementing this in a business setting, it’s important to test out these strategies while also collaborating with the Data Engineering team to explore the possibility of gathering more data.


Let’s talk about something else that’s important to keep in mind. Even though this model will help you find the optimal price for your product, it’s not a guarantee that everything will be sold or that you should use this price. Confused? I’ll show you what I mean…

Every business has a certain minimum unsold product rate that they can tolerate, and for your business, that could be around 10%. However, aiming for the minimum may not necessarily be the most profitable strategy. If you sell 80 out of 100 products, by 10 dollars, you’re probably better than selling 90 out of 100, by 8.5 dollars. This is where the term price elasticity of demand comes into play!

The term is defined as the ratio of the % change in demand of something to the % change in price. In simpler terms, it measures the extent to which your product’s sales will increase or decrease when the price is lowered or increased, respectively. By using this concept, you can determine the optimal price point that maximizes your profit. For instance, if your ML model sets the price for 90% of your sales and you know the amount by which each dollar decrease in price will increase your sales, you can plot the revenue curve to identify the price point that maximizes your profit.


There are several ways to determine the optimal price for your product or service. One of the simplest is to use Linear Regression to understand how demand changes with the price. You can even include additional variables to create a more accurate pricing curve. A helpful article on this strategy can be found here.

Linear regression won’t be always enough, of course. In my experience, there are many situations in which you can try to use panel models. For instance, let’s say you work for a company in which the price has some range according to the consumption. Company calls may set a price for those who makes 0–99 calls and another for those who makes 100–199 calls. Or maybe is something related to repairs or civil construction and the price is by meter, from 0 to 99 meters is 5 dollars by meter, from 100 to 200 is 7.5 dollars. In this case, you can use a Regression Discontinuity Design by setting a threshold close to the boundaries — the intuition here is that someone who makes 1 call per month is different than someone who makes 105 calls, but someone who calls 98 times is similar to someone who calls 102, the only difference among them will be the price!

You could also explore AB Testing to identify the elasticity, if you can’t find any opportunity to run a panel model or a linear regression. Be aware that elasticity might be much harder in real life than what we see in the books. I always recommend behavioral economics for people who are starting to work with price, there are wonderful insights from this area!


I hope this article was helpful and informative for you. I’ve been working with data for nearly 8 years and I love to talk about it, I’m always trying to learn and teach what I’ve learnt in all these years. If you have any questions or comments, or if you want to discuss this topic further, please don’t hesitate to reach out to me. I am always happy to connect with other professionals in the field and exchange knowledge and ideas.

Thank you for reading!




Mathematician with a master degree in Economics. Working as a Data Scientist for the last 10 years.