Comparison of Term Weighting and Word Embedding on Local Government Tweet Classification

Pande Made Risky Cahya Dinatha; Nur Aini Rakhmawati

doi:10.22146/jnteti.v9i2.90

Pande Made Risky Cahya Dinatha Sepuluh Nopember Institute of Technology
Nur Aini Rakhmawati Sepuluh Nopember Institute of Technology https://orcid.org/0000-0002-1321-4564

DOI: https://doi.org/10.22146/jnteti.v9i2.90

Keywords: Classification, Term Weighting, Word Embedding, Social Media, Short Text, Twitter

Abstract

The emergence of social media encourages the government to use social media to diseminate information to its people. The information must be beneficial for the people to maintain government to citizen relationships. Classification on social media post is possible to categorize the types of posts. The study was conducted on the local government`s social media accounts, yet the text processing in theresearch needsto be explored. Term weighting and word embedding are implemented in this research. The purpose is to compare term weighting term frequency-inverse document frequency, Okapi BM25, and word embedding doc2vec in producing features for the problem of short text classification.This study representsfeature selection process, how to assessclassification model, and to find the best model to overcome short text classification problem. There are six classes to categorize 1,000 short texts from 91 accounts. The measurements, i.e.precision, recall, f-1, macro-averages, micro-averages, and AUC,were calculated on each model. The result shows that the SVM linear kernel with TF-IDF performs best and slightly better than the logistic regressionwith 0.572 and 0.766 on macro-averagerecall and micro-average recall,respectively.

Author Biography

Nur Aini Rakhmawati, Sepuluh Nopember Institute of Technology

Lecturer, Department of System Information

References

R.D.Waters, E. Burnett, A. Lamm, dan J. Lucas, “Engaging Stakeholders Through Social Networking: How Nonprofit Organizations are Using Facebook,” Public Relations Review, Vol. 35, No. 2, hal. 102-106, 2009.

M. Magnusson, P. Bellström, dan C. Thoren, “Facebook Usage in Government - A Case Study of Information Content,” Proceeding of the Eighteenth Americas Conference on Information Systems, 2012, hal. 1-9.

Y. Wang, Z. Zhou, S. Jin, D. Liu, dan M. Lu, “Comparisons and Selections of Features and Classifiers for Short Text Classification,” International Conference on Artificial Intelligence Applications and Technologies 2017, 2018, hal. 1-7.

N. Indriani, E. Rainarli, dan K.E. Dewi, “Peringkasan dan Support Vector Machine pada Klasifikasi Dokumen,” Jurnal INFOTEL, Vol. 9, No. 4, hal. 416-421, 2017.

V. Mickevicius, T. Krilavicius, dan V. Morkevicius, “Classification of Short Legal Lithuanian Texts,” Proceedings of the 5th Workshop on Balto-Slavic Natural Language Processing 2015, 2015, hal. 106–111.

T. Parlar, S.A. Özel, dan F. Song, “Interactions Between Term Weighting and Feature Selection Methods on the Sentiment Analysis of Turkish Reviews,” dalam Computational Linguistics and Intelligent Text Processing. CICLing 2016, Lecture Notes in Computer Science, Vol. 9624, A. Gelbukh, Eds., Cham, Switzerland: Springer, 2018, hal. 335–346.

K.S. Jones, S. Walker, dan S.E. Robertson, “A Probabilistic Model of Information Retrieval: Development and Comparative Experiments: Part 1,” Information Processing and Management, Vol. 36, No. 6, hal. 779–808, 2000.

K.S. Jones, S. Walker, dan S.E. Robertson, “A Probabilistic Model of Information Retrieval: Development and Comparative Experiments: Part 2,” Information Processing and Management,Vol. 36, No. 6, hal. 809–840, 2000.

G.K. Prakoso, “Rancang Bangun Aplikasi untuk Klasifikasi Post pada Media sosial Pemerintah Daerah di Indonesia Menggunakan Support Vector Machine (SVM),” B.IS. tesis, Sepuluh Nopember Institute of Technology, Surabaya, Indonesia, 2018.

H. Hinterberger, Encyclopedia of Database Systems, 1st ed., New York, USA: Springer US, 2009.

Q.V. Le dan T. Mikolov, “Distributed Representations of Sentences and Documents,” Proceedings of the 31st International Conference on International Conference on Machine Learning, 2014, Vol. 32, hal. II-1188-II-1196.

H. Shin dan J. Paek, “Automatic Task Classification via Support Vector Machine and Crowdsourcing,” Mobile Information Systems, Vol. 2018 hal. 1-9, 2018.

J.S. Cramer, “The Origin of Logistic Regression,” Tinbergen Institute, Amsterdam, The Netherlands, Working Paper, No. 2002-119/4, Nov. 2002.

D.M.W. Powers, “Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation,” Flinders University of South Australia, Adelaide, Australia, Technical Report SIE-07-001, hal. 37–63, Des. 2007.

T. Fawcett, “An Introduction to ROC Analysis,” Pattern Recognition Letters, Vol. 27, No. 8, hal. 861–874, 2006.