Analisis Perbandingan Kemiripan Teks Bahasa Daerah di Indonesia Menggunakan Algoritma Naive Bayes dan K-Nearest Neighbor
Abstract
Indonesia, as an archipelagic country, has a wide variety of languages, with 718 regional languages. However, many regional languages face the risk of declining usage and even extinction. Technological developments have opened up opportunities to analyze the patterns and unique characteristics of regional languages through n-gram analysis using naive bayes and k-nearest neighbor algorithms. Therefore, this study was conducted with the aim of analyzing the similarity of regional languages, particularly Central Javanese, Sundanese, and Pontianak Malay, as part of an effort to assist in the preservation of regional languages in Indonesia. The similarity between languages was calculated based on errors in the confusion matrix, and the performance of the algorithms was evaluated using accuracy and F1-score metrics. The naive bayes algorithm with combined unigram and bigram features showed the best performance with an accuracy and F1-score of 0.921. The results of the study showed the highest similarity value in the ‘Javanese - Malay’ language, although only 3.82%, and the lowest in the ‘Malay - Sundanese’ language at 1.66%. These similarity values are based on the dominant characters that appear in a language, such as ‘e’ in Malay and ‘a’ and ‘u’ in Sundanese. This study proves that there is little similarity between Javanese, Sundanese, and Malay.
References
M. Yudhi Putra and D. Ismiyana Putri, “Pemanfaatan Algoritma Naïve Bayes dan K-Nearest Neighbor Untuk Klasifikasi Jurusan Siswa Kelas XI,” J. Tekno Kompak, vol. 16, no. 2, pp. 176–187, 2022.
T. Winarti, H. Indriyawati, V. Vydia, and F. W. Christanto, “Performance comparison between naive bayes and k-nearest neighbor algorithm for the classification of indonesian language articles,” IAES Int. J. Artif. Intell., vol. 10, no. 2, pp. 452–457, 2021, doi: 10.11591/IJAI.V10.I2.PP452-457.
N. Nurdin, M. Suhendri, Y. Afrilia, and R. Rizal, “Klasifikasi Karya Ilmiah (Tugas Akhir) Mahasiswa Menggunakan Metode Naive Bayes Classifier (NBC),” Sist. J. Sist. Inf., vol. 10, no. 2, pp. 268–279, 2021, doi: 10.32520/stmsi.v10i2.1193.
R. D. Kurniawan and J. Muliawan, “Sentiment Analysis of Indonesian Election 2024 Using the K-Nearest Neighbor Method,” J. Tek. Inform., vol. 5, no. 3, pp. 653–659, 2024, [Online]. Available: http://jutif.if.unsoed.ac.id/index.php/jurnal/article/view/1934%0Ahttp://jutif.if.unsoed.ac.id/index.php/jurnal/article/download/1934/493
D. C. Agustin, M. A. Rosid, and N. Ariyanti, “Implementasi Convolutional Neural Network Untuk Deteksi Kesegaran Pada Apel,” J. Fasilkom, vol. 13, no. 02, pp. 145–150, 2023, doi: 10.37859/jf.v13i02.5175.
S. Afolabi, N. Ajadi, A. Jimoh, and I. Adenekan, “Predicting Diabetes Using Supervised Machine Learning Algorithms,” Res. Sq., Jun. 2024, doi: 10.21203/rs.3.rs-4527374/v1.
M. Amin, “Bahasa Melayu Dalam Tradisi Islam Nusantara,” J. Islam. Soc. Sci., vol. 2, no. 2, pp. 64–77, 2021.
L. Y. Hu, M. W. Huang, S. W. Ke, and C. F. Tsai, “The distance function effect on k-nearest neighbor classification for medical datasets,” Springerplus, vol. 5, no. 1, 2016, doi: 10.1186/s40064-016-2941-7.
A. Irawan, J. Ahyar, and M. Mahsa, “Pemertahanan Bahasa Jawa di Tengah Masyarakat Multilingual Kecamatan Cot Girek,” J. Yudistira Publ. Ris. Ilmu Pendidik. dan Bhs., vol. 2, no. 4, pp. 368–385, 2024, doi: https://doi.org/10.61132/yudistira.v2i4.1202.
D. Kurniawan, Pengenalan Machine Learning dengan Python, 1st ed. Jakarta: Elex Media Komputindo, 2020.
Y. I. Kurniawan, “Perbandingan Algoritma Naive Bayes dan C.45 dalam Klasifikasi Data Mining,” J. Teknol. Inf. dan Ilmu Komput., vol. 5, no. 4, pp. 455–464, Oct. 2018, doi: 10.25126/jtiik.201854803.
N. Lailiyah and F. Indri Wijayanti, “Kekerabatan Bahasa Jawa, Bali, dan Bima: Perspektif Linguistik Historis Komparatif,” Linguist. Indones., vol. 40, no. 2, pp. 327–345, 2022.
O. Mailani, I. Nuraeni, S. A. Syakila, and J. Lazuardi, “Bahasa Sebagai Alat Komunikasi Dalam Kehidupan Manusia,” Online, 2022. [Online]. Available: www.plus62.isha.or.id/index.php/kampret
A. Purwanto and E. A. Darmadi, “Perbandingan Minat Siswa Smu Pada Metode Klasifikasi Menggunakan 5 Algoritma,” J. Komput. Dan Inform., vol. 2, no. 1, pp. 43–47, 2018.
H. Sujaini and A. Bijaksana Putra, “Analysis of language identification algorithms for regional Indonesian languages,” IAES Int. J. Artif. Intell., vol. 13, no. 2, p. 1741, 2024, doi: 10.11591/ijai.v13.i2.pp1741-1752.
S. S. Utama, A. W. Nuswantoro, A. Febrianto, and S. Mulyono, “Hubungan Kekerabatan Bahasa Jawa dan Bahasa Melayu (Kajian Linguistik Historis Komparatif),” J. Pendidikan, Bhs. dan Budaya, vol. 2, no. 3, pp. 60–76, 2023.
Copyright (c) 2025 Alfarizi, Herry Sujaini, Niken Candraningrum

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).


