Clustering topic groups of documents using K-Means algorithm: Australian Embassy Jakarta media releases 2006-2016

Wishnu Hardi; Wisnu Ananta Kusuma; Sulistyo Basuki

doi:10.22146/bip.36451

Clustering topic groups of documents using K-Means algorithm: Australian Embassy Jakarta media releases 2006-2016

https://doi.org/10.22146/bip.36451

Wishnu Hardi^(1*), Wisnu Ananta Kusuma⁽²⁾, Sulistyo Basuki⁽³⁾

(1) National Library of Australia
(2) Institut Pertanian Bogor
(3) Universitas Indonesia
(*) Corresponding Author

Abstract

Introduction. The Australian Embassy in Jakarta is storing a wide array of media release document. Analyzing particular and vital patterns of the documents collection is imperative as it will result in new insights and knowledge of significant topic groups of the documents.

Methodology. K-Means was used algorithm as a non-hierarchical clustering method which partitioning data objects into clusters. The method works through minimizing data variation within cluster and maximizing data variation between clusters.

Data Analysis. Of the documents issued between 2006 and 2016, 839 documents were examined in order to determine term frequencies and to generate clusters. Evaluation was conducted by nominating an expert to validate the cluster result.

Results and discussions. The result showed that there were 57 meaningful terms grouped into 3 clusters. “People to people links”, “economic cooperation”, and “human development” were chosen to represent topics of the Australian Embassy Jakarta media releases from 2006 to 2016.

Conclusions. Text mining can be used to cluster topic groups of documents. It provides a more systematic clustering process as the text analysis is conducted through a number of stages with specifically set parameters.

Keywords

Text mining; document clustering; K-Means algorithm, Cosine Similarity

Full Text:

PDF

References

Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. Retrieved December 2017 from https://arxiv.org/pdf/1707.02919.pdf

Davis, C.H., & Shaw, D. (2013). Introduction to information science and technology. Medford, N.J.: American Society for Information Society.

Dolamic, L., & Savoy, J. (2010). When stopword lists make the difference. Journal of the American Society for Information Science and Technology, 61(1), 200–203.

Feldman, R., & Sanger, J. (2007). The text mining handbook: Advanced approaches in analyzing unstructured data. New York: Cambridge University Press.

Gurusamy, V., Kannan, S., & Prabhu, J. R. (2017). Mining the attitude of social network users using K-Means clustering. International Journal of Advanced Research in Computer Science and Software Engineering, 7(5), 226–230.

Kannan, S., & Gurusamy, V. (2014). Preprocessing techniques for text mining. Paper presented at the Recent Trends and Research Issues in Computer Science (RTRICS) Conference, India. Retrieved from https://www.academia.edu/35015140/Preprocessing_Techniques_for_Text_Mining

Lama, P. (2013). Clustering system based on text mining using the K-means algorithm: News headlines clustering. Turku University of Applied Sciences. Retrieved November 2017 from http://www.theseus.fi/handle/10024/69505.

Mathew, S. (2012). Financial services data management: Big data technology in financial services. Oracle White Paper. Retrieved November 2017, from http://www.oracle.com /us/industries/financial-services/bigdata-in-fs-final-wp-1664665.pdf.

Miner, G. D., Elder, J., & Nisbet, R. A. (2012). Practical text mining and statistical analysis for non-structured text data applications. Cambridge : Academic Press.

Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137. Retrived from https://www.cs.odu.edu/~jbollen/IR04/readings/readings5.pdf

Prilianti, K.R., & Wijaya, H. (2014). Aplikasi text mining untuk automasi penentuan tren topik skripsi dengan metode K-Means Clustering. Jurnal Cybermatika, 2(1), 1–6.

Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(Nov.), 53–65.

Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523.

Solka, J. L. (2008). Text data mining: Theory and methods. Statistics Surveys, 2, 94–112.

Spire Technologies. (2016). Making sense of unstructured data with Spire. Retrived February 2018 from http://spiretechnologies.com/making-sense-unstructured-hr-data-spire/.

Sulistyo-Basuki. (2014). Senarai pemikiran Sulistyo Basuki: Profesor pertama ilmu perpustakaan dan informasi di Indonesia. Jakarta: Ikatan Sarjana Ilmu Perpustakaan dan Informasi Indonesia.

United Nations. (1961). Vienna convention on diplomatic relations. International and Comparative Law Quarterly, 10(3), 600-615.

Wahid, D.H., & Azhari, S. N. (2016). Peringkasan sentimen esktraktif di Twitter menggunakan hybrid TF-IDF dan Cosine Similarity. Indonesian Journal of Computing and Cybernetics Systems, 10(2), 207–218.

Zade, J., Bamnote, D., & Agrawal, P. (2017). Text document clustering using K-Means algorithm with its analysis and implementation. Imperial Journal of Interdisciplinary Research, 3(2), 1528–1531.

Zhao, Q., Xu, M., & Fränti, P. (2009). Sum-of-squares based cluster validity index and significance analysis. Adaptive natural computing algorithms, 5495(313-322).

DOI: https://doi.org/10.22146/bip.36451

Article Metrics

Abstract views : 10951 |

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

View My Stats

Username
Password
Remember me