Deep learning for customer churn prediction in e- commerce decision support

Churn prediction is a Big Data domain, one of the most demanding use cases of recent time. It is also one of the most critical indicators of a healthy and growing business, irrespective of the size or channel of sales. This paper aims to develop a deep learning model for customers’ churn prediction in e-commerce, which is the main contribution of the article. The experiment was performed over real e-commerce data where 75% of buyers are one-off customers. The prediction based on this business specificity (many one-off customers and very few regular ones) is extremely challenging and, in a natural way, must be inaccurate to a certain ex-tent. Looking from another perspective, correct prediction and subsequent actions resulting in a higher customer retention are very attractive for overall business performance. In such a case, predictions with 74% accuracy, 78% precision, and 68% recall are very promising. Also, the paper fills a research gap and contrib-utes to the existing literature in the area of developing a customer churn prediction method for the retail sector by using deep learning tools based on customer churn and the full history of each customer’s transactions.


Introduction
E-commerce has provided many new opportunities to consumers, and it is still open-ing new ones. The rapid expansion of IT technologies and the Internet has resulted in this sector's rapid growth. According to [1], the value of the e-commerce market in Poland exceeds 51 billion PLN, and 28 million Poles use online shopping in various forms. The global turnover of the e-commerce industry currently amounts to 3 trillion USD per year, and the fastestgrowing market is the Asian market, whose dynamics reaches almost 30%. For comparison, the value of the European e-commerce channel was estimated at over 602 billion EUR, of which nearly 60% was generated by three markets: British, German, and French.
The success of companies hugely depends on how well they can analyze the data on their clients' behavior. Customers' churn may be considered as a lost opportunity for profit. The costs of gaining new customers are usually five to even six times higher than the costs of retaining an existing customer [2]. As a result, efforts made by marketing specialists to sustain market share have switched from focusing on acquiring new customers to retaining existing onesreducing customer churn. For this reason, customer churn, also known as customer turnover, customer attrition, or customer deflection, is a major concern for a number of industries. This is particularly important in the e-commerce context, where consumers are able to com-pare products or services and change the vendor with minimal effort.
Churn prediction is a Big Data domain, one of the most demanding use cases of recent time. It is also one of the most critical indicators of a healthy growing business, irrespective of the size or channel of sales. Customer attrition allows specialists to estimate the number of customers who will give up on the company's product or ser-vice subscription in a given time frame.
According to Ph. Kotler [3], companies annually lose 10 to 30 per cent of customers while acquiring new customers is about ten times costlier than maintaining existing ones. This information-rich sentence indicates how valuable customers are for a business. Research done by Amy Gallo [4] states that depending on the industry, acquiring a new customer is anywhere from 5-25% more expensive than retaining an existing one. So, it is essential to keep customers happy. For exam-ple, the telecommunications industry experiences an average 30-35 per cent annual churn rate; additionally, it costs 5-10 times more to gain a new customer than to re-tain an existing one [5].
It is crucial for a contemporary business to start analyzing why customers abandon relationships with a company by cancelling services or ceasing to buy products. This type of analysis allows e-commerce specialists to modify their current activities and adjust offers so that customer's needs are better covered, resulting in a lower churn rate. This paper aims to develop a deep learning model for customers' churn prediction in ecommerce. The study pertains to the prediction of customer churn in B2C e-commerce. It also fills a research gap and contributes to the existing literature in the area of developing a customer churn prediction method for the retail sector by using deep learning tools based on customer churn and the full history of each customer's transactions, which is the major contribution of the article.

Related works
In the field of e-commerce, customer churn can be placed among the most critical problems that need to be addressed and thoroughly examined.
Compared to traditional shopping in retail stores, e-commerce has a significant advantage: instant and accurate track of records and in-depth data collection (shopping activities, order information, delivery information, billing address, etc.). This data collection allows multidimensional analysis of both customers and their buying hab-its, additionally helping businesses to treat customers as individuals and in a personal-ized manner. With the support of gathered data, it is possible to create customer-centric business intelligence based on the following business concerns [6]:  Which subpages did the customer visit? How long did they stay there? What was the sequence in which they browsed a given web page?  Who are the most/least valuable customers? What are their distinctive characteristics?  Who are the most/least loyal customers, and how are they characterized?  What are customers' purchase behavior patterns?  Which types of customers are more likely to respond to a particular promotion?  And so on There are many both academia researchers and practitioners who have been actively trying to predict customer churn with the help of gathered datastatistics, data mining, or machine learning strategies.
Customer churn is a buzzword that has been used for a long time in the field of ecommerce, and early determination of consumers that might be lost should be identified accurately through data mining and data analysis and related in time with effective marketing measures [7].
Churn models are made to detect, as soon as possible, signals of potential churn and help to identify customers willing to abandon a given company voluntarily.
Customer churn prediction is a very demanding and challenging process aimed at identifying consumers willing to abandon a company or a service. Decision-makers and machine learning specialists focus on designing models which can help to identify early churn signals and recognize consumers on the verge of a decision between leaving or continuing. Therefore, to retain customers, academics, as well as practitioners, find it crucial to build a churn prediction model that is as accurate as possible in order to minimize the risk of customer churn [8]. Also, researchers have confirmed that customer churn prediction models can improve a company's revenue and its reputation in the market [9], [10]. Reducing the rate of churn and retaining current customers are the most cost-effective marketing approaches that will maximize the shareholder's value [11], [12]. Today, companies have enough information, of every kind, about the behaviour of their customers -this has created an opportunity for the machine learning (ML) community to develop predictive modelling techniques to handle the customer churn prediction. [13].
During the last decade, customer churn prediction has received a growing consideration in order to survive in an increasingly competitive and global marketplace [14]. Companies should strive for models that can accurately identify potential churners, and this becomes even more important in the digital economy context. Over the last decade, this issue has been mentioned and researched by many practitioners and academics. In contemporary literature, we can observe two main trends concerning customer churn. According to [15], the first branch includes traditional classification methods such as decision tree (DT) and logistic regression (LR) [ [23]). The second mentioned line of thought is based on artificial intelligence methods such as neural networks [24] [17] [23]; [25], evolutionary learning [17], genetic algorithms [17][18], random forests [26]; [27], improved balanced random forests [28], K-nearest neighbour [29], fuzzy logic Systems [30], and support vector machines [28]. The decision tree and logistic regression are dedicated to the analysis of continuous data; they cannot, however, guarantee the accuracy of constructed models for large scale, nonlinearity, and high-dimensionality [31].
All of the presented models of customer churn prediction are very helpful in creating measures which can help a company to prevent customers from attrition. Worth mentioning is that customer churn predictive models are usually solely evaluated based on their predictive performance in which the models show the ability to correctly identify customer churns and non-churns separately and accurately [21].
For the customer churn prediction problem, most of the related academic works focus on the socalled post-paid industries. This means that the contract with the customer ends or is terminated (e.g. banks, Internet service providers, insurance companies, and telephone service companies) [32] [13] [33]. The subject of this paper, as it was mentioned in the introduction, is the use of deep learning algorithms for churn prediction in the retail industry. A characteristic feature of this sector is the uncertainty surrounding the return of a customer to the same seller. As they are not bound by any contract, they can easily abandon the existing relationship. That is why the purpose of this article is to create a model that calculates the probability that a customer will return to the same vendor and in how many days they will return.
It becomes very significant in the e-commerce context, where competitors are only a few 'mouse clicks' away, and consumers can compare and contrast competing products and services with minimal expenditure of personal time or effort and move from one company to another [34]. In our deliberations, we will narrow down the research area even further, focusing on only one part of e-commerce, namely the retail sector. We are going to develop a useful churn prediction model for B2C context outperforming the commonly used methods because of two reasons. It is a model capable of capturing the specific characteristics of B2C e-commerce relations, andthe second thingit can predict when the customer will return to the same vendor.
In most domains, churning is usually referred to as losing a client. For example, [35] predicts the churn probability for prepaid clients of a cellular telecommunication company. In financial services (banking and insurance), churn is usually seen as closing accounts [32]. [36] predict the switching probability of an insured person to another auto insurance company. As far as retail is concerned, most studies also focus on the customer's ability to leave to identify the exact moment when customers will discontinue their relationship with companies. In the retail sector literature, churn has also been considered as the partial or progressive defection of customers. [37] used several classification techniques and proposed predictive models for partial customer turnover in retail. Most customers exhibit partial defection, which may subsequently lead to a complete switch. Also, Buckinx and Van den Poel [26] used the concept of partial churn to identify customers that the company should focus on if concerned with customer retention. The costs of gaining new customers are usually five to even six times higher than the costs of retaining an existing customer [2]. Nevertheless, they still talked about attrition. Also churn models based on risks models has been developed [38].
Our study pertains to the prediction of customer churn in B2C e-commerce. In contrast to existing research, we developed a deep learning model based on the full history of each customer's transactions, which can be useful in existing customer segmentation mechanism.

Dataset and data processing
The original dataset consists of 626,275 rows and 131 columns. Each row concerns a single purchase and an aggregated history of all previous purchases of the customer who made it. The target variable, churn, indicates whether another purchase will be made by the same customer in the future. Preprocessing, conducted using Pandas 0.25.1 library installed under Python 3.7.4, included the removal of duplicates, redundant columns, and outliers. Principal component analysis was used as a dimensionality reduction technique to represent highly correlated variables (abs(Spearmann correlation) > 0.8). After those steps, class imbalance was very high, as the data consisted of 79% of rows with churn=1. Thus, random undersampling was used to achieve class balance. Finally, the data, having 152,456 rows and 113 columns, (112 predictors: base_price,discount, n_products,previous_Winter_hats,previous_Football_accessories,previo us_Dresses,previous_DKNY,previous_Polo_shirts,previous_Training_shoes,previous_Stripe s,previous_Lifestyle_shoes,previous_Swimsuits,previous_Care_products,previous_Wallets,p revious_Gloves_and_scarves,previous_Running_shoes,previous_Backpacks,previous_Gucc i,previous_Hilfiger,previous_Winter_coats,previous_Bags,previous_Shirts,previous_Casual_ shoes,previous_Skirts,previous_Tops,previous_Trainers,previous_Balls,previous_Vests,prev ious_Basketball_shoes,previous_Jeans,previous_Sandals,previous_Underwear,previous_Te nnis_shoes,previous_Outdoor_shoes,previous_Autumn_jackets,previous_Accessories,previ ous_Tracksuits,previous_Slippers,previous_Hiking_boots,previous_Trousers,previous_Glass es,previous_Training_accessories,previous_Sweaters,previous_Sweatshirts,previous_Shoes ,previous_Shorts,previous_Clothes,previous_Hats,previous_Football_boots,previous_Armani ,previous_Football_clothing,previous_Fleece,previous_Versace,previous_Socks In order to use a recurrent network topology, the data needed to be represented as a time series. Therefore, each row was transformed into a two-step series, with the first step including data on previous purchases and the secondon the current purchase. Some variables, such as the date of the purchase, were inadequate for such a transformation and were represented as two steps with identical values. The target variable was stored as a separate, 1-dimensional vector. The new dataset consisted of 152,456 time series with 2 steps and 58 features.

Model tuning
The prediction of customer churn was performed using two base artificial neural network topologies. A multilayer perceptron (MLP), with one or two fullyconnected dense layers was used. Also recurrent layer as a first hidden layer (RNN), optionally supported by an additional dense layer was used. Particular numbers of neurons were preliminarily selected by comparing the accuracy and F1 scores of models of different widths. When multiple models performed similarly, the simpler one was selected. Each network was optimized using a binary cross-entropy loss function. The output layer used a sigmoid activation function. For the purpose of overfitting prevention, each model was augmented with an extra dropout after every hidden layer. Both versions of the model, with and without dropout, were trained and compared. In case the "dying ReLU" problem appeared, each model was trained in two versionsusing the standard rectified linear unit activation function and using the Leaky ReLU activation function. All models were prepared and trained using Keras 2.3.1 library with TensorFlow 2.0.0 backend [www.tensorflow.org].

Learning
A 10-fold split was performed over a dataset. Each model was trained independently 10 times, using consecutive data sections as validation sets and the remaining parts as training sets. The batch size amounted to 10,000 randomly selected rows. The models consisting of only fully-connected layers were trained over 40 epochs. The models containing recurrent layers were trained over 60 epochs. Model accuracy on the training and validation set was measured after each epoch. After the last epoch, additional metrics were calculated.

Experimental results
The trained models were used to predict samples from the validation set. Based on these predictions, a set of metrics was calculated and presented in Table 1. Accuracy, precision, recall, and confusion matrices were calculated for the prediction threshold equaling 0.5. For each row, the metrics were averaged over 10 independent models of the same architecture (but different dataset split using the 10 folds). The first column describes the number, type, and width of hidden layers. The second column indicates the probability of dropout for each hidden layer. The third column concerns the activation function used. The next columns contain the averaged metric values and their standard errors, in parentheses. The last column presents an averaged confusion matrix of the model, denoted using estimated probability values in percentages. In order to compare the results of the 16 models, a series of statistical tests was performed, all with significance level α=0.05. The introductory testing for normality, performed using Shapiro-Wilk test, revealed non-normality in some groups for every metric. A Levene test suggested a heterogeneity of variance of the training accuracy between the results of the models. The results of those tests implied further use of non-parametric methods. A Kruskal-Wallis test was used to check the equality of medians between the groups, while A Dunn's test with a Holm adjustment was applied for the post-hoc analysis. A Levene test with a Bonferonni correction was used for pairwise comparisons of variance. Simultaneous inference of multiple metrics was performed using a Friedman's test, with a Nemenyi test used for the post-hoc analysis.
The resulting models were of similar quality, with the global average accuracy of 73.6%. Accuracy was significantly less volatile than precision (Nemenyi p=1.4x103) and recall (p=2.6x105). AROC had lesser variance than recall, with marginal significance (p=0.02).
The Friedman test revealed a marginally significant difference of the combined metrics between the models (p=0.01). The Nemenyi post-hoc test suggested only one relevant difference (p=0.05), as the one-layer recurrent model without dropout and with ReLU activation performed better than one-layer MLP with 30% dropout and Leaky ReLU, considering all the measured metrics.  Although only two models were compellingly different in overall performance, there were 92 significant differences between particular metrics. The Dunn's test sug-gested 30 differences in the training accuracy between models. They overlapped with the only 5 significant differences in test accuracy. There were 27 relevant differences between models in precision and 26in recall. AROC values consistently differed between 4 models. Table 2 presents the results of the Dunn's test. Each pair of values indicate the number of models that performed, respectively, worse and better than the competitive model. Table 2. Topologies of the models and aggregated results of the Dunn's post-hoc analysis.
Most models had a similar variance of all measured metrics. Precision was significantly less volatile (p=0.03) in the model with one recurrent and one dense layer, with 30% dropout and Leaky ReLU activations, than in the model with one dense layer, without dropout and with ReLU activation function. The model with two dense layers, 30% dropout, and Leaky ReLU has a lower variance of training accuracy than the same model without dropout (p=0.03), and than the recurrent model without a dense layer and dropout, with Leaky ReLU activation (p=0.04). Also, the model with one dense layer, 30% dropout, and Leaky ReLU activation had a significantly lower variance than the model with two dense layers, no dropout, and Leaky ReLU, with p=0.04.

Conclusions
The experiment was performed over real e-commerce data in an industry where 75% of buyers are one-off customers. It means that such a number of customers made a purchase only once and they have never returned to the store. In contrast, the number of regular customers (with more than 5 purchases) accounts for only 2% in the whole population. Such conditions in the e-commerce business make the input dataset unbalanced, which was mentioned in the specification of the method. It makes the churn prediction much more challenging than in any other line of business. The prediction basing on this business specificity (many one-off customers and very little regular ones), churn prediction is extremely challenging and in a natural way must be inaccurate to a certain level. Looking from another perspective, correct prediction and subsequent actions resulting with a higher customer retention are very attractive for the overall business performance. In such a case, prediction with 74% accuracy, 78% precision, and 68% recall is very promising. Even though in other business cases similar results could be considered insufficient, the achieved results are significantly promising. The presented research has a preliminary status. The main disadvantage is using only a filter method of feature selection. The application of wrapper methods is needed for the reduction of the input attributes set. Also instead of using random sampling to generate the training and test datasets it might be interesting to develop approach to ensure that all transactions of a customer can exist only in one dataset. A very important issue also consists in the identification of the point in time at which the customer will return to the same retailer. Such an approach can better address the churn problem in a retail business due to the unclear definition of the churned customer. The research in these areas will be performed in the future works.