Applying SMOTE-NC on CART Algorithm to Handle Imbalanced Data in Customer Churn Prediction: A Case Study of Telecommunications Industry

  • Ilma Amira Rahmayanti Statistics Study Program, Faculty of Science and Technology, University of Airlangga, Surabaya, Indonesia
  • Sediono Sediono Statistics Study Program, Faculty of Science and Technology, University of Airlangga, Surabaya, Indonesia
  • Toha Saifudin Statistics Study Program, Faculty of Science and Technology, University of Airlangga, Surabaya, Indonesia
  • Elly Ana Statistics Study Program, Faculty of Science and Technology, University of Airlangga, Surabaya, Indonesia
Keywords: SMOTE, CART, decision tree, machine learning, customer churn prediction

Abstract

These days, telecommunications is very much needed in all areas of life. This condition has made the competition among the company is extremely tense. One strategic way to protect the company is to retain existing customers. The retention program as a scheme to retain customers must be implemented precisely and efficiently so that the company can maintain as many customers as possible. In this case, customer churn prediction holds an essential role. However, the existence of imbalanced data can increase prediction errors and create problems. Hence, in order to overcome the issue, this study combined the Synthetic Minority Oversampling Technique – Nominal Continuous (SMOTE-NC) with Classification and Regression Trees (CART). SMOTE-NC was applied to balance classes on training data, while CART formed a classification tree from those balanced data. Then, this classification tree created by CART algorithm had become the basis for predicting customer churn. The data used in this study are from https://community.ibm.com/, where the variables are related to customer demographics, customer contracts, usage history, and customer status of one of the telecom companies. Based on the analysis of these data, SMOTE-NC and CART combination succeeded in reducing errors in predicting customer churn, which also led recall value to increase by approximately 19%. Moreover, the accuracy generated from this combination method was still in a pretty good range of over 75%. Therefore, this study proposes an excellent way to improve the performance of churn prediction, especially in the telecommunications industry.

Downloads

Download data is not yet available.
Published
2021-12-22