Performance Analysis of XGBoost in Handling Missing Data on the Telco Customer Churn Dataset

muhammad riki atsauri; Aulia Rahman Dalimunthe; Nugroho Syahputra

doi:10.47065/bit.v7i1.2524

muhammad riki atsauri * Politeknik Negeri Medan, Indonesia
Aulia Rahman Dalimunthe Politeknik Negeri Medan, Indonesia
Nugroho Syahputra Politeknik Negeri Medan, Indonesia

DOI: https://doi.org/10.47065/bit.v7i1.2524

Abstract

This study analyzes the performance of Extreme Gradient Boosting (XGBoost) algorithm in handling missing data for telecommunications customer churn prediction. The research objective is to compare the effectiveness of various missing data imputation techniques (mean, k-NN, and MICE) on XGBoost performance using the IBM Telco Customer Churn dataset. The research methodology includes data preprocessing, implementation of imputation techniques, XGBoost model training, and evaluation using accuracy, precision, recall, and F1-score metrics. The results show that MICE imputation technique provides the best performance improvement with 81.24% accuracy, 69.80% precision, 58.40% recall, and 63.60% F1-score, compared to XGBoost without imputation achieving 79.43% accuracy. These findings demonstrate that explicit missing data handling can enhance XGBoost's predictive capability in identifying potential churning customers. The practical implications of this research provide guidance for telecommunications industry in optimizing customer retention strategies through more accurate churn prediction

References

J. Zhang, H. Wang, and Y. Liu, “Handling missing data using the XGBoost-based multiple imputation approach for mine ventilation parameters,” Frontiers in Artificial Intelligence, vol. 8, art. no. 1553220, 2025, doi: 10.3389/frai.2025.1553220.

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 2016, pp. 785–794, doi: 10.1145/2939672.2939785.

S. Karimov, M. Li, and Y. Zhang, “Comparative study of imputation strategies to improve the accuracy of machine learning models,” Digital Health, vol. 11, art. no. 20552076241301960, 2025, doi: 10.1177/20552076241301960.

X. Liu, Y. Chen, and Z. Wang, “Customer churn prediction model based on hybrid neural network approach,” Scientific Reports, vol. 14, no. 1, art. no. 79603, 2024, doi: 10.1038/s41598-024-79603-9.

P. Boozary, A. Smith, and K. Johnson, “Enhancing customer retention with machine learning: A comparative analysis of ensemble approaches,” Machine Learning with Applications, vol. 15, art. no. 100138, 2025, doi: 10.1016/j.mlwa.2025.100138.

D. A. Ardhani, B. Kurniawan, and H. Santoso, “Knowledge discovery on e-commerce customer churn using interpretable machine learning: A comparative study of SHAP-based classifiers,” Journal of Applied Informatics and Computing, vol. 9, no. 5, pp. 745–758, 2025, doi: 10.30871/jaic.v9i5.10811.

S. A. Alteer, M. Rahman, and F. Ahmed, “Customer churn prediction using machine learning for Internet Service Providers,” IEEE Access, vol. 12, pp. 45678–45690, 2024, doi: 10.1109/ACCESS.2024.3415678.

A. Kumar and E. Zafar, “Predict customer churn with Python and machine learning,” SSRN Electronic Journal, 2024, doi: 10.2139/ssrn.5085192.

R. P. Gronloh, I. Setiawan, and A. Wibowo, “Analysis of determinants of customer churn at PT XYZ using machine learning,” Jurnal Info Sains: Informatika dan Sains, vol. 14, no. 4, pp. 745–758, 2024.

H. Rahman, D. Sari, and T. Prakoso, “IBM Telco customer churn prediction with survival analysis,” in Proc. ICATAM 2024, 2024, pp. 357–368, doi: 10.2991/978-94-6463-566-9_25.

A. Finocchi, M. Rossi, and L. Bianchi, “Multiple imputation integrated to machine learning for post-stroke ambulation prognosis,” Scientific Reports, vol. 14, art. no. 74537, 2024, doi: 10.1038/s41598-024-74537-8.

A. Widianti, Y. Suryanto, and R. Hidayat, “Penanganan missing values dan prediksi data timbunan sampah berbasis machine learning,” RABIT: Jurnal Teknologi dan Sistem Informasi Univrab, vol. 9, no. 2, pp. 349–358, 2024, doi: 10.36341/rabit.v9i2.4789.

M. Liu, L. Zhang, and H. Chen, “Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques,” Artificial Intelligence in Medicine, vol. 137, art. no. 102486, 2023, doi: 10.1016/j.artmed.2023.102486.

K. Kotan, S. Yilmaz, and O. Demir, “Cyclical hybrid imputation technique for missing values in machine learning,” Scientific Reports, vol. 15, art. no. 90964, 2025, doi: 10.1038/s41598-025-90964-7.

A. Rácz, K. Héberger, and D. Bajusz, “Comparison of missing value imputation tools for machine learning applications,” LWT – Food Science and Technology, vol. 215, art. no. 116395, 2025, doi: 10.1016/j.lwt.2025.116395.

M. J. Smith, R. Thompson, and K. Williams, “Comparison of common multiple imputation approaches in longitudinal studies,” Journal of Statistical Computation and Simulation, vol. 94, no. 3, pp. 412–430, 2024, doi: 10.1177/26320843231224809.

R. Thiesmeier, M. Wagner, and J. Schmidt, “Systematically missing data in distributed data networks: Multiple imputation strategies,” Journal of Statistical Computation and Simulation, vol. 95, no. 2, pp. 234–256, 2024, doi: 10.1080/00949655.2024.2404220.

Y. Pristyanto, A. Setiawan, and H. Nugroho, “Extreme gradient boosting algorithm to improve machine learning performance on imbalanced datasets,” International Journal on Informatics Visualization, vol. 7, no. 3, pp. 1102–1110, 2023.

K. Lee, S. Park, and J. Kim, “Evaluating missing data handling methods for developing machine learning-based energy benchmarking models,” Energy, vol. 301, art. no. 131257, 2024, doi: 10.1016/j.energy.2024.131257.

P. Suryanto, C. Widodo, and B. Hartono, “Analisis kinerja metode XGBoost dan LightGBM dalam menangani missing values pada dataset telekomunikasi,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 10, no. 2, pp. 245–254, 2023.

IBM Corporation, “Telco customer churn dataset,” IBM Cognos Analytics Sample Data, 2024. [Online]. Available: https://www.ibm.com/docs/en/cognos-analytics/

National Center for Health Statistics, “NHIS 2024 imputation technical documentation,” Centers for Disease Control and Prevention, 2024. [Online]. Available: https://ftp.cdc.gov/pub/Health_Statistics/NCHS/

W. Verbeke, D. Martens, C. Mues, and B. Baesens, “Building comprehensible customer churn prediction models with advanced rule induction techniques,” Expert Systems with Applications, vol. 38, no. 3, pp. 2354–2364, 2012, doi: 10.1016/j.eswa.2011.08.008.

R. J. Little and D. B. Rubin, Statistical Analysis with Missing Data, 3rd ed. Hoboken, NJ, USA: John Wiley & Sons, 2019.

A. Khare, A. S. Sabitha, and A. Samad, “Customer churn prediction in telecommunication using machine learning,” International Journal of Engineering Trends and Technology, vol. 69, no. 5, pp. 124–130, 2021.

M. R. Atsauri, H. Mawengkang, and S. Efendi, “Enhancing unbalanced data classification with cross-validation and extreme gradient boosting: A comprehensive analysis,” Journal of Informatics and Telecommunication Engineering, vol. 4, no. 2, pp. 143–154, 2021.

M. R. Atsauri, Analisis Kombinasi Cross Validation dan Extreme Gradient Boost pada Klasifikasi Data Tidak Seimbang, Universitas Sumatera Utara, Medan, Indonesia, 2022. [Online]. Available: https://repositori.usu.ac.id/handle/123456789/81956

Z. Budiarso, H. Listiyono, and A. Karim, “Optimizing LSTM with Grid Search and Regularization Techniques to Enhance Accuracy in Human Activity Recognition,” Journal of Applied Data Sciences, vol. 5, no. 4, pp. 2002–2014, Nov. 2024, doi: 10.47738/JADS.V5I4.433.