Benchmark Analysis of Algorithms for Customer Churn Analysis In Telecommunication Sector

Ramazan Akkoyun
27 min readDec 19, 2021

In the economic life, which started with the formation of the concept of trade, the introduction of terms such as exchange, shopping and money into daily life, trade and money have continued to increase their importance until today. Today, trade is carried out bilaterally, the service receiving party and the service provider party. So, what is a customer, we define a user who buys a service or product from a company or institution and pays for it as a “customer” in the basic sense.

With the origin of today’s modern life such as the invention of electricity and the telephone, the development of industry, the transition to mass production with industrialization, the economic model that directly or indirectly concerns all of us has emerged thanks to the constantly developing technology since the second half of the 20th century. During the formation of this economy model, there was a single company order on the basis of the sector and the service or product was provided by this company; After the easy access of raw materials and the spread of technology, it became inevitable for different players to enter the sector and the competitive environment has constantly increased. Considering the current economy, we can say that the customer is the reason for the existence of an institution that provides services or products. The axis, which was previously corporate-centered, has now shifted from the institution to the customer, and products and services are developed with a focus on the customer, and the main factor is the customer’s choices. Institutions build their products on a customer-centered structure and try to survive in the competitive environment by analyzing operational and marketing strategies with analytical methods. With the increasing competitive environment, convincing and retaining the existing customers is of great importance in the rapidly changing market. In terms of competitiveness and sustainability, this situation should be handled as a priority.

In addition, according to the results obtained in the studies; retaining an existing customer is both more likely and less costly than new customer acquisition. Because while 1x unit cost is spent to retain an existing customer, this rate is 5x unit cost for new customer acquisition. If we calculate the customer retention cost as 1x unit it can cost up to 16x units to make a new customer as profitable as an existing one, and reducing churn by just 5% can increase profitability by 25%-125%. [9]. According to a study involving 33 countries and more than 24 thousand participants, 68% of customers who left the institution they received service from stated that they would not receive service from the same institution again. Losing an existing customer to a competitor, apart from cost reasons, also has a bad effect in terms of corporate image, so it is possible for other customers to be adversely affected by this effect and leave the company. In addition, the brand value of the institutions shows parallelism with the number of active customers. Therefore, from a financial point of view, issues such as cost and profitability ratios, investment possibilities, cash operations are directly related to the number of active customers and because of that customer loyalty is very important.

In general, “loss for a business represents both a decrease in customer ratio and a decrease in customers’ returns for the business”. Although the rate varies according to the sector, it can be said that acquiring new customers is generally more costly than retaining existing customers for all sectors. Therefore, it is vital for businesses to make the right analyzes in a timely manner in order to fight effectively and efficiently in a competitive environment. [24]

Customer churn decision analysis [19]

The analysis methods that allow us to calculate the probability of customers to stop using a product or service and make different decisions according to these probability values are called customer churn analysis. In other words, we can call the customer churn analysis the process of estimating the customers with a high probability of leaving by examining the customer behavior by extracting the profiles of the customers who are likely to leave. Customer churn analysis is not a one-stage process, but should be considered as a whole with different actions. While the first stage is calculating the probability of customer churn, there should be a second stage where we can get answers to questions such as how we can make the customer happy before they leave.

The proposed model and analysis should be capable of finding the reasons behind customer churn and providing measures to retain them in order to prevent churn and avoid churn. It should also use techniques to predict when such a situation will occur in the future. It would also be a wrong method to measure people who would be qualified as loyal customers within the organization, because the part that interests us is the target audience whose potential needs are low and we cannot determine. It is necessary to analyze the methods of retaining customers with realistic targets, and to review every situation that causes loss.

We can characterize the factors that are important in churn analysis as follows:

  • Since our aim is to determine who they are, we first ask the “who” question. Then we try to group them to categorize their behavior.
  • Because a customer’s abandonment does not only affect that customer, the potential impacts should also be examined. These may be the possibility of the abandoning customer affecting another customer, or the use of resources towards the right customer.
  • Taking strict measures against the possibility of unpredictable situations and constantly checking the suitability of these measures
User growth statistics for Amazon Prime [16]

Today, when the number of users and the rate of new customer acquisition are at the saturation point, it is getting more and more difficult to acquire new customers. For this reason, institutions prioritize and actively use customer churn analysis applications and systems in order to retain their existing customers and survive in the competition. In our country, as new players enter the market in sectors such as banking, telecommunications, e-commerce and cargo/logistics, whose number of users are increasing day by day, institutions try many ways not to lose their customers to their competitors. One of these ways is the customer churn analysis system. In order to be able to do this analysis only, to obtain accurate results, and thus to experience minimum customer loss, institutions make large investments both in terms of infrastructure and human resources.

Market shared of mobile gsm operators in Turkey

The sector with the highest loss of customers is the telecommunications sector. The history of this industry dates back to the mid-1800s. Although it can be considered as a newer process in our country, customer loss reached its highest level in 2008 with the “Number Transfer” campaigns. It is not possible to prevent this situation when we do not have the strength to meet the disadvantages such as not being able to provide loyalty and having goods and services that cannot meet the demands of our customers.

Considering the existing economic model, we mentioned that the customer is the reason for the existence of an institution that provides a service or product. For this reason, the increase in customer satisfaction, which is located in the opposite situation of customer loss, has a special importance. A customer’s satisfaction quality rate or churn rate is revealed by collecting, processing and interpreting the following similar data. [23]

  •  Collection of customer data
  •  Examination of customer habits
  •  Classification of customers
  •  Increasing customer communication channels
  •  Increasing sales-oriented marketing strategies
Churn chart of mobile gsm operators in Turkey [13]

Types of Customer Loss

Loss of customers means that “customers give up preferring the institution or company from which they buy the service or product due to competition” [24]. Customers may tend to leave the institution they receive service from due to access to the cutting-edge technologies, social responsibility projects, fees, cost of replacement, effect of good advertising, geographical reasons and any other service promotions and campaigns [25].

It consists of two types depending on whether it is voluntary or involuntary in terms of the circumstances and conditions that cause the loss of customers. Voluntary losses; is the situation where a customer leaves the current business and buys or prefers the same good or service from another business. Involuntary losses; it is a situation of abandonment caused by force majeure or undesirable reasons that are not counted in line with the customers’ own wishes and demands. The important part here is about the reasons why the person who voluntarily parted ways with us and chose the competitor changed his/her choice. Involuntary loss types are often ignored in statistical and data mining studies. The main reason for this is the inability to prevent the loss of customers.

Customer churn analysis should not be considered only as customer loss, but should be analyzed using cause and effect relationship. Because the results to be obtained and the action plans related to these results will guide the recovery of both existing customers, new customers and customers who have left.

Customer churn analysis can basically be examined in 4 branches:

  •  New customer acquisition
  •  Retaining existing customers
  •  Minimizing the loss of customers
  •  Customer reacquisition

Customer Churn Reduction Methods

First of all, it is necessary to measure the situation that will cause the loss of the customer. The cost of retaining a customer is less than the cost of losing it.

  •  Communication and feedback
  •  Empathy
  •  Not to discriminate negatively against the customer
  •  Creating the wrong expectation
  •  Increasing prices in a way that abuses the customer
  •  Analyzing competitors well
  •  Making the customer feel valuable

In the models developed for customer loss; since the customer’s past habits and changes are examined, it is not possible for companies with millions or even billions of customers to carry out this study separately for each customer from the human eye. For this reason, efforts to reach a faster and optimum solution by using data mining, artificial neural networks and machine learning, deep learning techniques are increasing day by day.

Thanks to the increasing use of mobile devices after 2010 and the widespread internet network, studies in the field of technology, which we call big data, gained momentum. As a result of these studies, many data mining tools or machine learning models that can analyze data have been developed. Various solutions have been provided by using these tools and models and great progress has been made in data analysis. The only common aspect of these techniques used is that it can analyze the data and determine the reason behind it by estimating the probability of possible customer churn.

In this article, telecommunication data was examined and the best result was tried to be obtained by using 5 different algorithm techniques.

MATERIALS and METHODOLOGY

  • Dataset and Features

In this article, telecommunication sector data, where the competitive environment is extremely high and customer retention is equally important, was preferred. Our dataset, which consists of 7043 different customer data in total, contains 21 features in total, with 20 different features and 1 feature labeled as churn decision. Of these 20 features, 4 are demographics. 9 of them are different service purchases. 6 of them contain information about the customer’s account.

In our labeled data, churn was given as no for 5163 records, while churn was indicated as yes for the remaining 1869 records. Our churn rate was 26.58%.

  • Tools and Programming Languages

Python was chosen as the programming language in the study. Source codes were run in the Google Colab cloud working environment. Apart from these, additional tools that are installed as an extra package in the Python programming language and enable us to easily perform mathematical or statistical operations are also used. Tools such as NumPy to easily perform linear algebra methods, Pandas to facilitate file operations, Seaborn and Matplotlib to benefit from data visualization methods were preferred. Additionally, Scikit Learn was used to train machine learning models.

  • Algorithms

In this article, 5 different algorithm techniques were used to obtain the best result. These are as follows:

  1. Logistic Regression
  2. Random Forest
  3. Support Vector Machine
  4. AdaBoost
  5. XGBoost

A confusion matrix is a table often used to describe the performance of a classification model on a set of test data for which the actual values are known.

Confusion Matrix [8]
  • True Negative: If we say that something that does not exist in reality does not exist. The predicted here is negative = the actual here is negative
  • False Positive: If we say we exist to something that doesn’t really exist. The model predicted a positive value != The actual value here is negative.
  • False Negative: If we say we don’t exist to something that actually exists. The predicted value is negative != The actual value here is positive.
  • True Positives: If we say that we have something that actually exists. The predicted value here is positive = the actual value here is positive.

We can calculate the accuracy of our model with:

a) Logistic Regression

Logistic Regression [11]

In machine learning, algorithms that we call “prediction algorithm” are used for estimating numerical data. For the estimation of non-numerical categorical data, estimation is made using classification methods techniques. Logistic regression is a method that allows us to classify. It can be used with both categorical and numerical data. If the result we are looking for can only take 2 values, we can use it. Example Good-Bad, Yes-No, Much-Less etc. In logistic regression, we are only concerned about the probability of outcome dependent variable (true or false).

It is calculated with the maximum likelihood estimation (MLE) method in logistic regression.

Sigmoid function [15]

Since the sigmoid function is used in logistic regression, the resulting graph gives a curved visual. The sigmoid function is S-shaped by its mathematical nature. Since the result values of the sigmoid function are between 0 and 1, it should only be used in categorical problems with 2 possibilities. Therefore, if we want a probability calculation as output in our model, the use of the Sigmoid Function can be preferred. The Sigmoid function would be the right choice, since by mathematical definition the probability of something is only between 0 and 1. [12]

b) Random Forest

The Random Forest algorithm is a flexible, easy-to-use supervised machine learning algorithm that usually produces a good result even without hyper-parameter tuning. As another definition, Random Forest algorithm, which is an ensemble learning method, is an algorithm that aims to increase the classification value by producing more than one decision tree during the classification process. Individually created decision trees come together to form a decision forest. The decision trees here are randomly selected subsets from the data set to which they are connected. It is one of the most used and popular algorithms today. This is because it is both simple and easy to use and applicable for classification and regression problems.

In order to understand how the Random Forest algorithm works, we first need to understand how decision trees work which are the basis of this algorithm. In the decision tree logic, the training data set is given as input. The decision tree creates some rules to make predictions with the help of this data. On the other hand, there are various decision trees in the Random Forest algorithm.

One of the biggest problems of decision trees is over-learning or overfitting data. In the models developed with the Random Forest algorithm, different subsets are randomly selected from both the data set and the feature set, and they are trained to solve this problem. With this method, many different decision trees are formed and each decision tree makes an individual prediction. As a result, the individual estimates are averaged. In the Random Forest model, variance or overfitting decreases as training takes place on different data sets. It can also be said that it is a composition of decision trees for the Random Forest algorithm. [10]

As the name suggests, it is formed by the combination of randomly selected decision trees. This resulting forest is a combination of decision trees that are usually trained by the ensemble methods such as bagging. The main purpose of using the bagging method here is that we want to use a combination of other learning models to improve the overall result of our model as much as possible. In the simplest terms, in the Random Forest model, more accurate and stable predictions can be made by combining the results of these decision trees by creating more than one decision tree, which is one of the other learning models.

In this algorithm, additional randomness is added to the model as you grow trees. When splitting a node, instead of looking for the most important feature, the best feature is searched among a randomly selected subset of features. Training the model by searching in subsets rather than the whole has allowed for increased diversity and thus generally better results for our model. It can be randomized for each feature rather than looking for the best possible thresholds. The best possible threshold is sought in decision trees.

One of the good logics in the Random Forest algorithm is that it is very easy to measure the overall importance of each feature on the prediction. The significance of a feature is related to how much that feature contributes to the explanation of the variance in the dependent variable. In the random forest method, we can measure the importance of features in our data. In this way, we choose which features to use and which not to use. We should remove the features that are of little importance due to the basis of machine learning or that we think will not work in the process of training the model, so that our complexity can be reduced and our model training can be more successful. [14]

The Gini index is a metric used to measure how often a randomly selected item is detected incorrectly. In this algorithm, there is double branching, right and left and it can be calculated very rapidly. A feature with a low Gini index should be preferred. The Gini index works either pass or fail for the categorical target variable. Gini index only performs binary splits. Each decision tree created is left in its widest form and is not pruned.

The Gini index is “calculated to measure the homogeneity of the classes”. The lower the Gini index, the “more homogeneous” the class. A branch is successful when the Gini index of a child node is less than the Gini index of a parent node.

The Gini index is calculated using the following formula:

(pi)2 = The square of each data in the data set by the number of elements less than and greater than itself.

c = Selected data

Gini index formula

After the Gini index is determined, our test data sets are classified based on the Gini index. The most appropriate classification is made in the total of the results. The comparison process is made with Gini indices, if the test data set values and the indexes of the training data set are the same, the test data set sample falls into the relevant class.

The most important advantage is that it is very useful and easy to use. It can achieve more precise results for many datasets compared to AdaBoost and Support Vector Machine algorithms. It can be trained in a very short time. It removes noisy data.

We may experience slow estimation because we create too many trees. Although they can be trained rapidly by nature, once they are trained, the process of producing predictions is more than the training process. If we want to increase our accuracy rate, we need to use more trees, which leaves us in a dilemma as it will increase the time.

Random Forest algorithm can be used when the data we have is not linear. And also it is not important that extrapolation outside the training data.

c) Support Vector Machine

It is a fast and dependable algorithm that performs very well used for classification. It is more practical for small-sized data.

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of both categories, an SVM training algorithm creates a model that assigns new examples to one category or the other, turning it into a non-probabilistic binary linear classifier.

Basic SVM

In the example we showed above, we have two classes, black and white. Our main goal in this problem is to decide in which class the incoming data will be in black or white. In order to make this decision, a line separating the two classes is drawn and the region within ±1 of this line is called Margin. The wider the margin, the better the classes are separated. The drawn line is called a hyperplane.

This line can also be described as the decision boundary. In other words, the sides of the line constitute the sides of the data we want to classify in our model.

In Support Vector Machine, the plane line we have drawn in a way that leaves the maximum distance between the data we want to classify is the best plane line for us. This plane line is referred to as hyperplane in terminology. Our aim is that the closest member in the class we have reserved is the farthest from the hyperplane.

According to our examples and data, our Margin region may not always be empty. In some cases, the data in our dataset may be included in the margin region. We call this situation as a Soft Margin. In cases where our data can be separated linearly and if there is no outlier, Hard Margin will be more useful. In other cases, it is more useful to use Soft Margin.

Hard Margin vs Soft Margin [7]

Low dimensions may not be sufficient to explain complex datasets. If we increase the size, the training takes too long as the transactions will increase. To fix this problem, we need to resort to a method called Kernel Trick. By multiplying the data we have with certain kernel functions, we can make it much more meaningful.

d) Adaptive Boosting (AdaBoost)

AdaBoost, short for Adaptive Boosting, was the first practical algorithm and remains one of the most widely used and studied ones even today. Boosting is a general strategy for learning “strong models” by combining multiple simpler ones. A “weak learner” is a model that will do at least slightly better than chance. AdaBoost can be applied to any classification algorithm, but most often, it’s used with decision trees with only one node and two leaves.

AdaBoost offers a specific philosophy in machine learning, as a ensemble learning tool, starting from the basic idea that many weak learners can get better results from a stronger learning entity. [20]

It may not be possible for a fragmented and isolated classifier to predict correctly, but the model will become stronger as time passes, thanks to the predictions that more than one single classifier of the same structure will produce and learning the right from each other’s mistakes.

AdaBoost can be used alone or on top of any other classifier. In this way, the classifier on which it is applied can be a model that can learn from its mistakes and produce a more accurate result.

A weak learner is a learning sub-model that is better than generating completely random results but still has a lower success rate than we want for classification.

Decision stumps enhance this by dividing samples into two subgroups based on the value of a feature. Each stump selects a feature, for example X2, and a threshold, T, and then divides the samples into two groups on either side of the threshold.

To find the decision mass that best fits the samples, we can try every feature of the input with every possible threshold and see which gives the best accuracy. While it may seem naive that there are an infinite number of options for the threshold, two different thresholds are only meaningfully different if they put some examples on different sides of the divide. To test each possibility, we can sort the samples by the feature in question and try a threshold falling between each pair of adjacent samples. [26]

First; Initialize sample weights uniformly as below:

And then For each iteration t;

Step 1: A weak learner is trained on top of the weighted training data X. The weight of each sample wi indicates how important it is to classify the sample correctly.

Step 2: After training, the weak learner gets a weight based on its accuracy as below formula.

Step 3: The weights of misclassified samples are updated

Step 4: Renormalize weights so they sum up to 1

After the steps are over, make predicts using a linear combination of the weak learners.

e) Extreme Gradient Boosting (XGBoost)

XGBoost(eXtreme Gradient Boosting) is a high performance version of the Gradient Boosting algorithm optimized with various modifications. Some of the most important features of the XGBoost are that it can achieve advance predictive rate, prevent overfitting, handle null values and do them rapidly. Thanks to the hardware and software optimization techniques, better results can be obtained by using less resources. As of today, it is stated as the best of the algorithms that form the infrastructure of decision trees.

Evaluation of Boosting Algorithms

It is formed by the combination of Gradient Descent and Boosting concepts. Boosting is an ensemble learning method, and in this method, we take a subset of our data set. Again, we build a model on that subset. However, another base model (weak learner) that we will build from now on is not independent from the first, but rather works to develop the first model. The word gradient is used to describe slope in mathematics, but it is a preferred term in multidimensional scenarios where the same function is derived from different variables.

The first step in XGBoost is to base score prediction. This estimate can be any number as the correct result will be reached by converging with the actions to be taken in the next steps. This number is 0.5 by default. How good this estimation is is examined with the wrong estimations (residual) of the model. Errors are found by subtracting the estimated value from the observed value. [5]

The next step is to construct a decision tree that predicts errors. The aim here is to learn the errors and approach the correct guess. The similarity score is calculated for each branch of the tree created. The similarity score indicates how well the data are grouped into branches.

In this formula, lambda(λ), is regularization parameter.

After the similarity scores are calculated, the next question is whether a better guess can be made. Trees of all possible possibilities are constructed to answer this question. Similarity scores are calculated for all of them. The gain will be calculated to decide which tree is better. While branches are evaluated with “similarity score”, the whole tree is evaluated with “gain”.

After determining which tree has the highest gain value and deciding to use the tree, the prune process will begin. For pruning, the value called “gamma” is selected. Gamma is an evaluation of the gain score. Branches with a gain score lower than the gamma score will be pruned. So increasing gamma helps only keep valuable branches on the tree and prevent over-learning. Pruning continues from the last branch upwards. If it is decided not to prune the lowest branch, there is no need to examine the branches above it.

The default gamma value is 0. Gamma value of 0 means that if the gain score is negative, the branches will be pruned. If all branches are pruned, the best guess is 0.5 from the beginning.

The lambda value refines the model. According to the similarity score formula, as lambda increases, the calculated similarity score, hence the gain score, decreases. As a result, the model is more generous in pruning. Only high-scoring branches can survive, avoid pruning.

Lambda value defaults to 1. As lambda increases, learning will become more difficult, but overfitting will decrease. Lambda, again due to its position in the formula, affects branches with a single value more. In other words, the more values in the branch, the less it lowers the lambda similarity score. So it doesn’t want to have few values in the branch. This feature also helps prevent over-learning. As lambda increases, the output value will be less than it should be. That is, the higher the lambda, the more iterations the correct prediction will be reached.

Unlike other models, XGBoost accepts data in a different structure called DMatrix, not as a numPy array or pandas dataFrame. For this reason, we need to convert the data to DMatrix format before setting up our model.

  • Programming

The data set to be used, which is the first step in the programming phase of the project, has been determined. Telecommunications industry data shared as open source by IBM was preferred. Our process has progressed through data selection, exploratory data analysis, selection of algorithms and training of our models, testing step and comparison of results. In the exploratory data analysis phase, the data were analyzed demographically, and the data were improved by using methods such as dirty data cleaning, outlier detection and data transformation. When our data was ready to be used, the feature selection step was run and it was decided with which feature set our models would be trained.

Model training and testing phases were run for 5 different algorithms. Apart from this, an additional study was carried out in a recursive manner for 3 different algorithms in order to determine the hyper-parameter values. Since the hyper-parameter detection part is constantly repeated with different values, the model training and testing process takes a very long time.

In the model training part, the training set ratio of our data was determined as 0.75, and the test training set ratio was determined as 0.25. Random_state value is given as 101.

The algorithm training part started with the logistic regression model. The model was trained with default parameter values and our accuracy rate was 0.8145. Then f1- score and confusion matrix values were calculated.

According to our logistic regression model, the 2 biggest factors in the churn decision were total charges and contract type. The most important features that retain the churn decision were determined as the number of years of service was used, using the internet service and having a long-term contract.

Random forest algorithm was used as the second algorithm. Our success rate was 0.8179. Then f1- score and confusion matrix values were calculated.

According to our random forest algorithm, the 3 most important factors in the churn decision were determined as contract type, how long the service has been used and total charges.

Our third algorithm is the support vector machine algorithm. Our success rate was 0.8225. Then f1- score and confusion matrix values were calculated.

Our fourth algorithm is the AdaBoost method. Our success rate was 0.8202. Then f1- score and confusion matrix values were calculated.

Our last algorithm is the XGBoost algorithm, which gives the best results. Our success rate was 0.8242. Then f1- score and confusion matrix values were calculated.

COMPARISON

Decision Tree to XGBoost

If we have little time to develop a model, Random Forest should be the reason for preference. Generally speaking, Random Forest is a simple and flexible method that produces fast results in general, despite its limitations. [22]

  • Support Vector Machines (SVM) are the separation and classification of points on the plane by a line or hyperplane.
  • We can control the balance between Hard Margin and Soft Margin with C. Margin gets smaller as C gets bigger. If the model overfits, C needs to be reduced.
  • Tricks that pretend to solve unexplained changes in small dimensions by increasing size are called Kernel Tricks.
  • In the SVM algorithm, only data that can be linearly separated can be classified. If the data is not linearly separated, we need to use the kernel trick at this point. We can apply the SVM algorithm on non-linear data by using the kernel trick.

It is easier to use without hyper-parameter tuning unlike algorithms like SVM. We can also use AdaBoost with SVM.

Some of disadvantages of AdaBoost are:

  •  It can be easily affected by noisy data
  •  Since it has a perfectionist approach by nature, possible outlier values affect the model negatively.

Why XGBoost Perform Well?

Gradient Boosting and XGBoost work on the same principle. The differences between them are in the details. XGBoost shows much better prediction success using several techniques and is optimized to work on large datasets. Below are the main differences that differentiate it from Gradient Boosting.

  • Hardware and Software Optimization
  • Pruning
  • Lambda
  • Handle Null Values

Lambda and pruning processes were mentioned above.

Working with null values; one of the biggest problems with data sets is that there are null values. In the background of XGBoost, since the first prediction is set to 0.5 by default, residual values will also be generated for rows with blank data as a result of the prediction. In the decision tree created for the second guess, the error values in the rows with empty data are placed in different branches for all possible possibilities and the winning score is calculated for each case. In which case the score is higher, null values will be assigned to those branches.

Weighted quantile sketch; XGBoost builds decision trees in all possible scenarios to maximize the earnings score for each variable. Such algorithms are called “Greedy Algorithm”. This process can take a very long time in large datasets. Instead of examining each value in the data, XGBoost divides the data into pieces (quantiles) and works according to these pieces. As the amount of parts is increased, the algorithm will look at smaller intervals and make better predictions. Of course, this will increase the learning time of the model. The number of pieces is 33 by default. The problem with this approach is of course the performance issue. To identify the pieces, each column must be lined up, the boundaries of the pieces determined, and trees established. This causes slowness. An algorithm called “Sketches” is used to overcome the problem. Its purpose is to converge to find the pieces. Defines the boundaries of parts with the XGBoost Weighted Sketches Algorithm. Weight = Previous Value * (1 – Previous Value). The higher the weight, the more unstable the estimate. The parts are determined according to these weights. The weights are divided into approximate parts. In this way, branches with unstable prediction values are divided into smaller intervals. This will help in more accurate estimation.

System optimization; Our computers have different types of memory such as hard disk, RAM and cache. Cache memory is the fastest used but the smallest memory. If a program is desired to run fast, the cache should be used at the maximum level. XGBoost calculates similarity score and tree output value in cache. For this reason, quick calculations can be made.

RESULT AND CONCLUSION

Churn detection analysis methods are in a very important position for companies. For this reason, studies in this field and attempts to achieve higher success are increasing day by day. In our study, we tried to make churn prediction analysis with 5 different algorithms. In this study, in which we generally use tree-based algorithm techniques, we obtained the best result with the XGBoost algorithm. We talked about why we get the best results with XGBoost in the previous pages of our study.

XGBoost is one of the best algorithms available in terms of both predictive power and algorithm execution speed. However, although it is a powerful algorithm, the secret is optimizing the hyper-parameters. Optimum numbers can be found by starting with a low learning rate and gamma and gradually updating these parameters. Although XGBoost is a very good algorithm, there is no rule that it will always give the best result. Whenever possible, other algorithms should be tried as well. [6]

REFERENCES

1. Carolina Andrea Martinez Troncoso, “Predicting Customer Churn using Voice of the Customer. A Text Mining Approach”, 2018

2. Laureando Valentino Avon, “Machine learning techniques for customer churn prediction in banking environments”, 2016

3. Anders Schatzmann , Christoph Heitz and Thomas Münch, “Churn prediction based on text mining and CRM data analysis”, 2014

4. Gideon Dror, Dan Pelleg, Oleg Rokhlenko and Idan Szpektor, “Churn Prediction in New Users of Yahoo! Answers”, 2012

5. Ezgi Demir, Sait Erdal Dincer, “ Determining The Faulty And Refund Products in Manufacturing System: Application on a Textile Firm”, 2020

6. https://www.datasciencearth.com/boosting-algoritmalari/

7. medium.com/deep-learning-turkiye

8. plug-n-score.com

9. https://hbr.org/1990/09/zero-defections-quality-comes-to-services

10. https://www.tutorialspoint.com/machine_learning_with_python/classification_algorith ms_random_forest.htm

11. https://www.analyticsvidhya.com/blog/2021/04/beginners-guide-to-logistic- regression-using-python/

12. https://www.geeksforgeeks.org/advantages-and-disadvantages-of-logistic- regression/

13. Source: BTK (ICTA) 2017 4th Quarter Report

14. https://mljar.com/blog/feature-importance-in-random-forest/

15. https://towardsdatascience.com/is-random-forest-better-than-logistic-regression-a- comparison-7a0f068963e4

16. Google Images for Amazon Prime

17. https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

18. Kaitlin Kirasich, Trace Smith, Bivin Sadler. “ Random Forest vs Logistic Regression: Binary Classification for Heterogeneous Datasets”, 2018

19. Source : PwC Survey 2017/18

20. https://www.techopedia.com/definition/33213/adaboost

21. Irfan Ullah, Basit Raza, Ahmad Kamran Malik, Muhammad Imran, Saif Ul Islam, Sung Won Kim. “A Churn Prediction Model Using Random Forest: Analysis of Machine Learning Techniques for Churn Prediction and Factor Identification in Telecom Sector”, 2019

22. devhunteryz.wordpress.com

23. Bagheri and Tarakh, 2015; Farquad,Ravi and Raju, 2014; Helfert and Heinrich, 2003

24. Nettleton, 2014

25. Farquad et al, 2014; Hejazinia and Kazemi, 2014

26. https://blog.paperspace.com/adaboost-optimizer/

--

--