Performance Analysis of XGBoost in Handling Missing Data on the Telco Customer Churn Dataset
Abstract
This study analyzes the performance of Extreme Gradient Boosting (XGBoost) algorithm in handling missing data for telecommunications customer churn prediction. The research objective is to compare the effectiveness of various missing data imputation techniques (mean, k-NN, and MICE) on XGBoost performance using the IBM Telco Customer Churn dataset. The research methodology includes data preprocessing, implementation of imputation techniques, XGBoost model training, and evaluation using accuracy, precision, recall, and F1-score metrics. The results show that MICE imputation technique provides the best performance improvement with 81.24% accuracy, 69.80% precision, 58.40% recall, and 63.60% F1-score, compared to XGBoost without imputation achieving 79.43% accuracy. These findings demonstrate that explicit missing data handling can enhance XGBoost's predictive capability in identifying potential churning customers. The practical implications of this research provide guidance for telecommunications industry in optimizing customer retention strategies through more accurate churn prediction
References
J. Zhang, H. Wang, and Y. Liu, “Handling missing data using the XGBoost-based multiple imputation approach for mine ventilation parameters,” Frontiers in Artificial Intelligence, vol. 8, art. no. 1553220, 2025, doi: 10.3389/frai.2025.1553220.
T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 2016, pp. 785–794, doi: 10.1145/2939672.2939785.
S. Karimov, M. Li, and Y. Zhang, “Comparative study of imputation strategies to improve the accuracy of machine learning models,” Digital Health, vol. 11, art. no. 20552076241301960, 2025, doi: 10.1177/20552076241301960.
X. Liu, Y. Chen, and Z. Wang, “Customer churn prediction model based on hybrid neural network approach,” Scientific Reports, vol. 14, no. 1, art. no. 79603, 2024, doi: 10.1038/s41598-024-79603-9.
P. Boozary, A. Smith, and K. Johnson, “Enhancing customer retention with machine learning: A comparative analysis of ensemble approaches,” Machine Learning with Applications, vol. 15, art. no. 100138, 2025, doi: 10.1016/j.mlwa.2025.100138.
D. A. Ardhani, B. Kurniawan, and H. Santoso, “Knowledge discovery on e-commerce customer churn using interpretable machine learning: A comparative study of SHAP-based classifiers,” Journal of Applied Informatics and Computing, vol. 9, no. 5, pp. 745–758, 2025, doi: 10.30871/jaic.v9i5.10811.
S. A. Alteer, M. Rahman, and F. Ahmed, “Customer churn prediction using machine learning for Internet Service Providers,” IEEE Access, vol. 12, pp. 45678–45690, 2024, doi: 10.1109/ACCESS.2024.3415678.
A. Kumar and E. Zafar, “Predict customer churn with Python and machine learning,” SSRN Electronic Journal, 2024, doi: 10.2139/ssrn.5085192.
R. P. Gronloh, I. Setiawan, and A. Wibowo, “Analysis of determinants of customer churn at PT XYZ using machine learning,” Jurnal Info Sains: Informatika dan Sains, vol. 14, no. 4, pp. 745–758, 2024.
H. Rahman, D. Sari, and T. Prakoso, “IBM Telco customer churn prediction with survival analysis,” in Proc. ICATAM 2024, 2024, pp. 357–368, doi: 10.2991/978-94-6463-566-9_25.
A. Finocchi, M. Rossi, and L. Bianchi, “Multiple imputation integrated to machine learning for post-stroke ambulation prognosis,” Scientific Reports, vol. 14, art. no. 74537, 2024, doi: 10.1038/s41598-024-74537-8.
A. Widianti, Y. Suryanto, and R. Hidayat, “Penanganan missing values dan prediksi data timbunan sampah berbasis machine learning,” RABIT: Jurnal Teknologi dan Sistem Informasi Univrab, vol. 9, no. 2, pp. 349–358, 2024, doi: 10.36341/rabit.v9i2.4789.
M. Liu, L. Zhang, and H. Chen, “Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques,” Artificial Intelligence in Medicine, vol. 137, art. no. 102486, 2023, doi: 10.1016/j.artmed.2023.102486.
K. Kotan, S. Yilmaz, and O. Demir, “Cyclical hybrid imputation technique for missing values in machine learning,” Scientific Reports, vol. 15, art. no. 90964, 2025, doi: 10.1038/s41598-025-90964-7.
A. Rácz, K. Héberger, and D. Bajusz, “Comparison of missing value imputation tools for machine learning applications,” LWT – Food Science and Technology, vol. 215, art. no. 116395, 2025, doi: 10.1016/j.lwt.2025.116395.
M. J. Smith, R. Thompson, and K. Williams, “Comparison of common multiple imputation approaches in longitudinal studies,” Journal of Statistical Computation and Simulation, vol. 94, no. 3, pp. 412–430, 2024, doi: 10.1177/26320843231224809.
R. Thiesmeier, M. Wagner, and J. Schmidt, “Systematically missing data in distributed data networks: Multiple imputation strategies,” Journal of Statistical Computation and Simulation, vol. 95, no. 2, pp. 234–256, 2024, doi: 10.1080/00949655.2024.2404220.
Y. Pristyanto, A. Setiawan, and H. Nugroho, “Extreme gradient boosting algorithm to improve machine learning performance on imbalanced datasets,” International Journal on Informatics Visualization, vol. 7, no. 3, pp. 1102–1110, 2023.
K. Lee, S. Park, and J. Kim, “Evaluating missing data handling methods for developing machine learning-based energy benchmarking models,” Energy, vol. 301, art. no. 131257, 2024, doi: 10.1016/j.energy.2024.131257.
P. Suryanto, C. Widodo, and B. Hartono, “Analisis kinerja metode XGBoost dan LightGBM dalam menangani missing values pada dataset telekomunikasi,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 10, no. 2, pp. 245–254, 2023.
IBM Corporation, “Telco customer churn dataset,” IBM Cognos Analytics Sample Data, 2024. [Online]. Available: https://www.ibm.com/docs/en/cognos-analytics/
National Center for Health Statistics, “NHIS 2024 imputation technical documentation,” Centers for Disease Control and Prevention, 2024. [Online]. Available: https://ftp.cdc.gov/pub/Health_Statistics/NCHS/
W. Verbeke, D. Martens, C. Mues, and B. Baesens, “Building comprehensible customer churn prediction models with advanced rule induction techniques,” Expert Systems with Applications, vol. 38, no. 3, pp. 2354–2364, 2012, doi: 10.1016/j.eswa.2011.08.008.
R. J. Little and D. B. Rubin, Statistical Analysis with Missing Data, 3rd ed. Hoboken, NJ, USA: John Wiley & Sons, 2019.
A. Khare, A. S. Sabitha, and A. Samad, “Customer churn prediction in telecommunication using machine learning,” International Journal of Engineering Trends and Technology, vol. 69, no. 5, pp. 124–130, 2021.
M. R. Atsauri, H. Mawengkang, and S. Efendi, “Enhancing unbalanced data classification with cross-validation and extreme gradient boosting: A comprehensive analysis,” Journal of Informatics and Telecommunication Engineering, vol. 4, no. 2, pp. 143–154, 2021.
M. R. Atsauri, Analisis Kombinasi Cross Validation dan Extreme Gradient Boost pada Klasifikasi Data Tidak Seimbang, Universitas Sumatera Utara, Medan, Indonesia, 2022. [Online]. Available: https://repositori.usu.ac.id/handle/123456789/81956
Z. Budiarso, H. Listiyono, and A. Karim, “Optimizing LSTM with Grid Search and Regularization Techniques to Enhance Accuracy in Human Activity Recognition,” Journal of Applied Data Sciences, vol. 5, no. 4, pp. 2002–2014, Nov. 2024, doi: 10.47738/JADS.V5I4.433.
Copyright (c) 2026 muhammad riki atsauri, Aulia Rahman Dalimunthe, Nugroho Syahputra

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).


.png)
.png)


