Predicting customer churn in retail



Predicting customer churn has become a ubiquitous activity in any industry. Keeping existing customers is several times more cost effective than on boarding new ones. It has been shown that existing customer base brings more revenue to the business and have higher margin. Also, up and cross selling to existing customer base is easier and more likely. 

For these reasons organisations want to have a good churn prediction system.

From a modelling point of view there are two usual approaches to modelling churn: 

1.    Survival approach, where we model time to churn or time to next purchase

2.   Classification approach, where we model the probability that a customer will churn in a certain time period in the future.

In this article we will go over the second approach and describe the way to build a binary classifier.

Data availability

Churn is predicted on the customer level, which poses problems for certain retailers. E-commerce side of businesses solve this problem easily by introducing registration, which then introduces a unique customer ID for each customer. Other variants include loyalty schemes and more recently deliveries help alleviate this problem as well.

Defining churn

Arguably the most important part in predictive modelling is setting the target variable, i.e. defining what customer churn is. Intuitively, it is a moment in time when the customer decides to sever the relationship with your business. However, for different types of customers this can mean different things.

Every retail organisation will do a segmentation of their customer base into high, medium and low profitability, and a good way to define churn is to use the customer segmentation. 

We can define the churn point as the point in time after which the 3-month spend drop for a certain percentage. Here is where the customer segmentation comes in: churn for each segment is different, i.e. a smaller drop in percentage for high value clients is more alarming than a larger drop in lower value clients. Setting such a target variable will really predict the drop in 3-month spend drop for a certain percentage, depending on the segment. 

Figure1. Churn definition

Figure1. Churn definition


Defining the model

Once we have a target definition we need to draw features. Retailers track and should use data from all parts of their business: basket transactions, order fulfilment, complaints, web analytics and clickthrough, promotions, sign up, demographics. 

Generally, the data can be divided into static and transactional data. Static data are easy to incorporate into the model, while transaction data require transformation in order to fit into a ML model structure.

Hence the idea is to use static data and draw features from transactional data to predict drop in sales in the subsequent 3 months. 

Transactions can go back in history for a long time, and behaviour of customers also varies in time. Thus, it is advisable to reduce the historical window for feature extraction to 6 – 12 months. The idea behind this is that is somebody was a loyal customer and stopped being one, we don’t want to accumulate these behaviours. Period of sampling observations is usually 6-12 months. 

The usual transformations of features include recency, frequency and monetary, i.e. the average number of days between each two consecutive shops, number of days since last shop and monetary value of shops.

Another way to extract features is to extract features in time periods prior to the purchase, i.e. total number of purchases/value of purchases/or per categories in the last month, between the last and the month before, and so on. This kind of feature extraction also captures the recency, frequency, monetary and more complex relationships. Even though the transformations are computationally slightly more expensive, the results are usually better.

Measuring performance and setting threshold

Given that we are dealing with a binomial classification problem the output of the model is the probability of detecting the target class.

To assess the performance of such a model we usually look at the Receiver operating characteristic (ROC) curve. If a churn model is used to target customers, we are usually sending offers to likely churners as a part of the retention programmes. ROC curves fall in nicely in this setting, as they show the trade-off between true positives and false positives.

And lastly, in order to decide who to target a simple methodology we can use is the calculation of maximal revenue/profit. By assigning the cost of true positives and false positives we can get the point of optimal returns on campaigns.