Analisis Perbandingan Kemiripan Teks Bahasa Daerah di Indonesia Menggunakan Algoritma Naive Bayes dan K-Nearest Neighbor

  • Alfarizi * Mail Universitas Tanjungpura, Indonesia
  • Herry Sujaini Universitas Tanjungpura, Indonesia
  • Niken Candraningrum Universitas Tanjungpura, Indonesia
Keywords: K-Nearest Neighbor; Naive Bayes; Regional Language

Abstract

Indonesia, as an archipelagic country, has a wide variety of languages, with 718 regional languages. However, many regional languages face the risk of declining usage and even extinction. Technological developments have opened up opportunities to analyze the patterns and unique characteristics of regional languages through n-gram analysis using naive bayes and k-nearest neighbor algorithms. Therefore, this study was conducted with the aim of analyzing the similarity of regional languages, particularly Central Javanese, Sundanese, and Pontianak Malay, as part of an effort to assist in the preservation of regional languages in Indonesia. The similarity between languages was calculated based on errors in the confusion matrix, and the performance of the algorithms was evaluated using accuracy and F1-score metrics. The naive bayes algorithm with combined unigram and bigram features showed the best performance with an accuracy and F1-score of 0.921. The results of the study showed the highest similarity value in the ‘Javanese - Malay’ language, although only 3.82%, and the lowest in the ‘Malay - Sundanese’ language at 1.66%. These similarity values are based on the dominant characters that appear in a language, such as ‘e’ in Malay and ‘a’ and ‘u’ in Sundanese. This study proves that there is little similarity between Javanese, Sundanese, and Malay.

References

M. Yudhi Putra and D. Ismiyana Putri, “Pemanfaatan Algoritma Naïve Bayes dan K-Nearest Neighbor Untuk Klasifikasi Jurusan Siswa Kelas XI,” J. Tekno Kompak, vol. 16, no. 2, pp. 176–187, 2022.

T. Winarti, H. Indriyawati, V. Vydia, and F. W. Christanto, “Performance comparison between naive bayes and k-nearest neighbor algorithm for the classification of indonesian language articles,” IAES Int. J. Artif. Intell., vol. 10, no. 2, pp. 452–457, 2021, doi: 10.11591/IJAI.V10.I2.PP452-457.

N. Nurdin, M. Suhendri, Y. Afrilia, and R. Rizal, “Klasifikasi Karya Ilmiah (Tugas Akhir) Mahasiswa Menggunakan Metode Naive Bayes Classifier (NBC),” Sist. J. Sist. Inf., vol. 10, no. 2, pp. 268–279, 2021, doi: 10.32520/stmsi.v10i2.1193.

R. D. Kurniawan and J. Muliawan, “Sentiment Analysis of Indonesian Election 2024 Using the K-Nearest Neighbor Method,” J. Tek. Inform., vol. 5, no. 3, pp. 653–659, 2024, [Online]. Available: http://jutif.if.unsoed.ac.id/index.php/jurnal/article/view/1934%0Ahttp://jutif.if.unsoed.ac.id/index.php/jurnal/article/download/1934/493

D. C. Agustin, M. A. Rosid, and N. Ariyanti, “Implementasi Convolutional Neural Network Untuk Deteksi Kesegaran Pada Apel,” J. Fasilkom, vol. 13, no. 02, pp. 145–150, 2023, doi: 10.37859/jf.v13i02.5175.

S. Afolabi, N. Ajadi, A. Jimoh, and I. Adenekan, “Predicting Diabetes Using Supervised Machine Learning Algorithms,” Res. Sq., Jun. 2024, doi: 10.21203/rs.3.rs-4527374/v1.

M. Amin, “Bahasa Melayu Dalam Tradisi Islam Nusantara,” J. Islam. Soc. Sci., vol. 2, no. 2, pp. 64–77, 2021.

L. Y. Hu, M. W. Huang, S. W. Ke, and C. F. Tsai, “The distance function effect on k-nearest neighbor classification for medical datasets,” Springerplus, vol. 5, no. 1, 2016, doi: 10.1186/s40064-016-2941-7.

A. Irawan, J. Ahyar, and M. Mahsa, “Pemertahanan Bahasa Jawa di Tengah Masyarakat Multilingual Kecamatan Cot Girek,” J. Yudistira Publ. Ris. Ilmu Pendidik. dan Bhs., vol. 2, no. 4, pp. 368–385, 2024, doi: https://doi.org/10.61132/yudistira.v2i4.1202.

D. Kurniawan, Pengenalan Machine Learning dengan Python, 1st ed. Jakarta: Elex Media Komputindo, 2020.

Y. I. Kurniawan, “Perbandingan Algoritma Naive Bayes dan C.45 dalam Klasifikasi Data Mining,” J. Teknol. Inf. dan Ilmu Komput., vol. 5, no. 4, pp. 455–464, Oct. 2018, doi: 10.25126/jtiik.201854803.

N. Lailiyah and F. Indri Wijayanti, “Kekerabatan Bahasa Jawa, Bali, dan Bima: Perspektif Linguistik Historis Komparatif,” Linguist. Indones., vol. 40, no. 2, pp. 327–345, 2022.

O. Mailani, I. Nuraeni, S. A. Syakila, and J. Lazuardi, “Bahasa Sebagai Alat Komunikasi Dalam Kehidupan Manusia,” Online, 2022. [Online]. Available: www.plus62.isha.or.id/index.php/kampret

A. Purwanto and E. A. Darmadi, “Perbandingan Minat Siswa Smu Pada Metode Klasifikasi Menggunakan 5 Algoritma,” J. Komput. Dan Inform., vol. 2, no. 1, pp. 43–47, 2018.

H. Sujaini and A. Bijaksana Putra, “Analysis of language identification algorithms for regional Indonesian languages,” IAES Int. J. Artif. Intell., vol. 13, no. 2, p. 1741, 2024, doi: 10.11591/ijai.v13.i2.pp1741-1752.

S. S. Utama, A. W. Nuswantoro, A. Febrianto, and S. Mulyono, “Hubungan Kekerabatan Bahasa Jawa dan Bahasa Melayu (Kajian Linguistik Historis Komparatif),” J. Pendidikan, Bhs. dan Budaya, vol. 2, no. 3, pp. 60–76, 2023.

Dimensions Badge
Published
2025-11-30
Section
Articles