Self-Training Naive Bayes Berbasis Word2Vec untuk Kategorisasi Berita Bahasa Indonesia

Joan Santoso; Agung Dewa Bagus Soetiono; Gunawan; Endang Setyati; Eko Mulyanto Yuniarno; Mochamad Hariadi; Mauridhi Hery Purnomo

Joan Santoso Institut Teknologi Sepuluh Nopember
Agung Dewa Bagus Soetiono Sekolah Tinggi Teknik Surabaya
Gunawan Sekolah Tinggi Teknik Surabaya
Endang Setyati Sekolah Tinggi Teknik Surabaya
Eko Mulyanto Yuniarno Institut Teknologi Sepuluh Nopember
Mochamad Hariadi Institut Teknologi Sepuluh Nopember
Mauridhi Hery Purnomo Institut Teknologi Sepuluh Nopember

Keywords: Kategorisasi Berita, Word2Vec, Skip-Gram, Self-Training, Naive Bayes, Semi-supervised Learning, Bahasa Indonesia

Abstract

News as one kind of information that is needed in daily life has been available on the internet. News website often categorizes their articles to each topic to help users access the news more easily. Document classification has widely used to do this automatically. The current availability of labeled training data is insufficient for the machine to create a good model. The problem in data annotation is that it requires a considerable cost and time to get sufficient quantity of labeled training data. A semi-supervised algorithm is proposed to solve this problem by using labeled and unlabeled data to create classification model. This paper proposes semi-supervised learning news classification system using Self-Training Naive Bayes algorithm. The feature that is used in text classification is Word2Vec Skip-Gram Model. This model is widely used in computational linguistics or text mining research as one of the methods in word representation. Word2Vec is used as a feature because it can bring the semantic meaning of the word in this classification task. The data used in this paper consists of 29,587 news documents from Indonesian online news websites. The Self-Training Naive Bayes algorithm achieved the highest F1-Score of 94.17%.

References

(2017) Internet User Penetration in Indonesia from 2015 to 2022. [Online], https://www.statista.com/statistics/254460/internet-penetration-rate-in-indonesia/, tanggal akses : 01 Januari 2017.

D. Rahmawati dan M. L. Khodra, “Word2vec Semantic Representation in Multilabel Classification for Indonesian News Article,” 4th IGNITE Conf. 2016 Int. Conf. Adv. Informatics Concepts, Theory Appication. ICAICTA 2016, 2016, hal. 1–6.

W. Xu, H. Sun, C. Deng, dan Y. Tan, “Variational Autoencoder for Semi-Supervised Text Classification,” Proc. of the Thirty-First AAAI Conf. on Artificial Intelligence (AAAI-17), 2017, hal. 3358–3364.

Z. Xu, J. Li, B. Liu, J. Bi, R. Li, dan R. Mao, “Semi-Supervised Learning in Large Scale Text Categorization,” Journal Shanghai Jiaotong University, Vol. 22, No. 3, hal. 291–302, 2017.

C. Olivier, B. Schölkopf, dan A. Zien, Semi-Supervised Learning, Cambridge, USA: The MIT Press, 2006.

J. Santoso, E. M. Yuniarno, dan M. Hariadi, “Large Scale Text Classification using Map Reduce and Naive Bayes Algorithm for Domain Specified Ontology Building,” Proceedings - 2015 7th International Conference on Intelligent Human-Machine Systems and Cybernetics, IHMSC 2015, 2015, Vol. 1, hal. 428–432.

J. Lilleberg, Y. Zhu, dan Y. Zhang, “Support Vector Machines and Word2vec for Text Classification with Semantic Features,” Proc. IEEE 14th International. Conference on Cognitive Informatics & Cognitive Computing, 2015, hal. 136–140.

Z. Gong dan T. Yu, “Chinese Web Text Classification System Model Based on Naive Bayes,” 2010 International Conference on E-Product E-Service and E-Entertainment (ICEEE), 2010, hal. 1–4.

A. Tripathy, A. Agrawal, dan S. K. Rath, “Classification of Sentimental Reviews Using Machine Learning Techniques,” Procedia Computer Science, Vol. 57, hal. 821–829, 2015.

Arifin dan K. E. Purnama, “Classification of Emotions in Indonesian TextsUsing K-NN Method,” International Journal of Information and Electronics Engineering, Vol. 2, No. 3, hal. 899-903, Nov. 2012.

A. Zaini, M. A. Muslim, dan Wijono, “Pengelompokan Artikel Berbahasa Indonesia Berdasarkan Struktur Laten Menggunakan Pendekatan Self Organizing Map,” Jurnal Nasional Teknik Elektro dan Teknologi Informasi, Vol. 6, No. 3, hal. 259-267, 2017.

A. Blum dan T. Mitchell, “Combining Labeled and Unlabeled Data with Co-Training,” Proceedings of the Eleventh Annual Conference on Computational Learning Theory - COLT’ 98, 1998, hal. 92–100.

S. Kiritchenko dan S. Matwin, “Email Classification with Co-Training,” Proceedings of the 2001 Conference of the Centre for Advanced Studies on Collaborative Research, 2001, hal. 301–312.

R. Guzmán-Cabrera, M. Montes-Y-Gómez, P. Rosso, dan L. Villaseñor-Pineda, “Using the Web as Corpus for Self-Training Text Categorization,” Information Retrieval, Vol. 12, No. 3, hal. 400–415, 2009.

J. Laksana dan A. Purwarianti, “Indonesian Twitter Text Authority Classification for Government in Bandung,” 2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA), 2014, hal. 129–134.

A. Rachmania, J. Jaafar, dan N. Zamin, “Likelihood Calculation Classification for Indonesian Language News Documents,” 2013 International Conference on Information Technology and Electrical Engineering (ICITEE), 2013, hal. 149–154.

B. Y. Pratama dan R. Sarno, “Personality Classification Based on Twitter Text using Naive Bayes, KNN and SVM,” Proc. 2015 Int. Conf. Data Softw. Eng. ICODSE 2015, 2015, hal. 170–174.

A. R. Naradhipa dan A. Purwarianti, “Sentiment Classification for Indonesian Message in Social Media,” Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, 2011, hal. 2–5.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, dan J. Dean, “Distributed Representations of Words and Phrases and Their Compositionality,” Advances in Neural Information Processing Systems, 2013, hal. 3111–3119.

C. Xing, D. Wang, X. Zhang, and C. Liu, “Document Classification with Distributions of Word Vectors,” Asia-Pacific Signal and Information Processing Association, 2014 Annual Summit and Conference (APSIPA), 2014, hal. 1–5.

R. Wongso, F. A. Luwinda, B. C. Trisnajaya, O. Rusli, and Rudy, “News Article Text Classification in Indonesian Language,” Procedia Computer Science, Vol. 116, hal. 137–143, 2017.

F. Z. Tala, “A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia,” M.Sc. thesis, University of Amsterdam, Netherlands, 2003.

O. Somantri dan M. Khambali, “Feature Selection Klasifikasi Kategori Cerita Pendek Menggunakan Naive Bayes dan Algoritme Genetika,” Jurnal Nasional Teknik Elektro dan Teknologi Informasi, Vol. 6, No. 3, hal. 301-306, 2017.

M. Naili, A. H. Chaibi, dan H. H. Ben Ghezala, “Comparative Study of Word Embedding Methods in Topic Segmentation,” Procedia Computer Science, Vol. 112, hal. 340–349, 2017.

A. Søgaard, “Semi-Supervised Learning and Domain Adaptation in Natural Language Processing,” Synth. Lect. Hum. Lang. Technol., Vol. 6, No. 2, hal. 1–103, 2013.

A. Fujino, H. Isozaki, dan J. Suzuki, “Multi-label Text Categorization with Model Combination based on F1-score Maximization,” Proc. IJCNLP, 2008, hal. 823–828.