Ekstraksi Frasa Kunci pada Penggabungan Klaster berdasarkan Maximum-Common-Subgraph

Adhi Nurilham; Diana Purwitasari; Chastine Fatichah

Adhi Nurilham Institut Teknologi Sepuluh Nopember
Diana Purwitasari Institut Teknologi Sepuluh Nopember
Chastine Fatichah Institut Teknologi Sepuluh Nopember

Keywords: pelabelan klaster, penggabungan klaster, Frequent Phrase Mining, Maximum Common Subgraph, Topic Rank

Abstract

Document clustering based on topic similarities helps users in searching from a collection of scientific articles. Topic labels are necessesary for describing subjects of the document clusters. Clusters with related subjects or contextual similarities can be merged to produce more descriptive labels. Relations between those words in one context can be modelled as a graph. Instead of single word, this paper proposed cluster labeling of phrases from scientific articles withcluster merging based on graph. The proposed method begins with K-Means++ for clustering the scientific articles. Then, the candidates of word phrases from document clusters are extracted using Frequent Phrase Mining which inspired by Apriori algorithm. Each cluster result has a representation graph from those extracted word phrases. An indicator value from each graph shows any similarities of graph structures which is calculated with Maximum Common Subgraph (MCS). Those clusters are merged if there are any structure similarities between them. Topic labels of clusters are keyword phrases extracted from a representation graph of previous merged clusters using TopicRank algorithm. The merging process which becomes the contribution of this paper is considering topic distribution within clusters for phrase extraction. The proposed method evaluationis performed based on topic coherence of the merged clusterslabel. The results show that proposed method can improve topic coherence on the merged clusters with MCS graph size percentage as the key factor.Further observation shows that merged cluster labels consistent to MCS graph.

References

H. Park, K. Kwon, A. i. Z. Khiati, J. Lee, dan I. J. Chung, “Agglomerative Hierarchical Clustering for Information Retrieval Using Latent Semantic Index,” 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), 2015, hal. 426–431.

S. Shah dan X. Luo, “Exploring Diseases Based Biomedical Document Clustering and Visualization Using Self-Organizing Maps,” 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom), 2017, hal. 1–6.

A. Wahib, A. Z. Arifin, dan D. Purwitasari, “Improving Multi-Document Summary Method Based on Sentence Distribution,” TELKOMNIKA (Telecommunication Comput. Electron. Control., Vol. 14, No. 1, hal. 286, 2016.

A. Zaini, M. A. Muslim, dan W. Wijono, “Pengelompokan Artikel Berbahasa Indonesia Berdasarkan Struktur Laten Menggunakan Pendekatan Self Organizing Map,” J. Nas. Tek. Elektro dan Teknol. Inf., Vol. 6, No. 3, hal. 259-267, 2017.

D. Purwitasari, C. Fatichah, I. Arieshanti, dan N. Hayatin, “K-medoids Algorithm on Indonesian Twitter Feeds for Clustering Trending Issue as Important Terms in News Summarization,” Proc. 2015 Int. Conf. Inf. Commun. Technol. Syst. ICTS 2015, 2015, hal. 95–98.

P. Hennig, P. Berger, C. Steuer, C. Wuerz, dan C. Meinel, “Cluster Labeling for the Blogosphere,” 2014 IEEE Fourth International Conference on Big Data and Cloud Computing, 2014, hal. 416–423.

P. Xie dan E. P. Xing, “Integrating Document Clustering and Topic Modeling,” Proc. 29th Conf. Uncertain. Artif. Intell., 2013, hal. 694-703.

Q. Mei, X. Shen, dan C. Zhai, “Automatic Labeling of Multinomial Topic Models,” Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discov. data Min. - KDD ’07, 2007, hal. 490-499.

T. L. Griffiths dan M. Steyvers, “Finding Scientific Topics,” Proc. Natl. Acad. Sci., Vol. 101, No. Supplement 1, hal. 5228–5235, 2004.

C. Aalla dan V. Pudi, “Mining Research Problems from Scientific Literature,” 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016, hal. 351–360.

Z. Li, J. Li, Y. Liao, S. Wen, dan J. Tang, “Labeling Clusters from Both Linguistic and Statistical Perspectives: A Hybrid Approach,” Knowledge-Based Syst., Vol. 76, hal. 219–227, 2015.

N. Y. Saiyad, H. B. Prajapati, dan V. K. Dabhi, “A Survey of Document Clustering Using Semantic Approach,” 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), 2016, hal. 2555–2562.

J. Jayabharathy, S. Kanmani, dan A. A. Parveen, “Document Clustering and Topic Discovery Based on Semantic Similarity in Scientific Literature,” 2011 IEEE 3rd International Conference on Communication Software and Networks, 2011, hal. 425–429.

S. S. Sonawane dan P. A. Kulkarni, “Graph based Representation and Analysis of Text Document: A Survey of Techniques,” Int. J. Comput. Appl., Vol. 96, No. 19, hal. 1–8, Jun. 2014.

S. Sonawane dan P. Kulkarni, “Graph based Representation and Analysis of Text Document: A Survey of Techniques,” Int. J. Comput. Appl., Vol. 96, No. 19, hal. 1–8, 2014.

N. Shanavas, H. Wang, Z. Lin, dan G. Hawe, “Centrality-Based Approach for Supervised Term Weighting,” IEEE Int. Conf. Data Min. Work. ICDMW, 2017, hal. 1261–1268.

F. Role dan M. Nadif, “Beyond Cluster Labeling: Semantic Interpretation of Clusters‟ Contents Using a Graph Representation,” Knowledge-Based Syst., Vol. 56, hal. 141–155, 2014.

A. El-Kishky, Y. Song, C. Wang, C. Voss, dan J. Han, “Scalable Topical Phrase Mining from Text Corpora,” Proc. VLDB Endow., Vol. 8, No. 3, hal. 305–316, 2014.

T. Mikolov, G. Corrado, K. Chen, dan J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Proc. Int. Conf. Learn. Represent. (ICLR 2013), 2013, hal. 1–12.

J. Wu, Z. Xuan, dan D. Pan, “Enhancing Text Representation for Classification Tasks with Semantic Graph Structures,” Int. J. Innov. Comput. Inf. Control, Vol. 7, No. 5, hal. 13–16, 2011.

L. Sterckx, T. Demeester, J. Deleu, dan C. Develder, “Topical Word Importance for Fast Keyphrase Extraction,” Proc. 24th Int. Conf. World Wide Web - WWW ’15 Companion, 2015, No. 2, hal. 121–122.

M. Röder, A. Both, dan A. Hinneburg, “Exploring the Space of Topic Coherence Measures,” Proc. Eighth ACM Int. Conf. Web Search Data Min. - WSDM ’15, 2015, hal. 399–408.

R. Mihalcea dan P. Tarau, “TextRank: Bringing Order into Texts,” Proc. EMNLP, Vol. 85, hal. 404–411, 2004.

A. Hulth, “Improved Automatic Keyword Extraction Given More Linguistic Knowledge,” Proc. 2003 Conf. Empir. Methods Nat. Lang. Process., 2003, No. 2000, hal. 216–223.

M. Grineva, M. Grinev, dan D. Lizorkin, “Extracting Key Terms from Noisy and Multitheme Documents,” Proc. 18th Int. Conf. World wide web - WWW ’09, 2009, hal. 661-670.

K. S. Hasan dan V. Ng, “Automatic Keyphrase Extraction: A Survey of the State of the Art,” Proc. 52nd Annu. Meet. Assoc. Comput. Linguist. (Volume 1 Long Pap.), 2014, hal. 1262–1273.

L. H. Suadaa dan A. Purwarianti, “Combination of Latent Dirichlet Allocation (LDA) and Term Frequency-Inverse Cluster Frequency (TFxICF) in Indonesian Text Clustering with Labeling,” 2016 4th Int. Conf. Inf. Commun. Technol. ICoICT 2016, 2016, hal. 1-6.

D. Carmel, H. Roitman, dan N. Zwerdling, “Enhancing Cluster Labeling Using Wikipedia,” Proc. 32nd Int. ACM SIGIR Conf. Res. Dev. Inf. Retr. - SIGIR ’09, 2009, hal. 139–146.

L. Page, S. Brin, R. Motwani, dan T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” World Wide Web Internet Web Inf. Syst., Vol. 54, No. 1999–66, hal. 1–17, 1998.

X. Wan dan J. Xiao, “CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction,” Proc. 22nd Int. Conf. Comput. Linguist. Coling 2008, 2008, hal. 969–976.

Z. Liu, P. Li, Y. Zheng, dan M. Sun, “Clustering to Find Exemplar Terms for Keyphrase Extraction,” Proc. 2009 Conf. Empir. Methods Nat. Lang. Process.: Vol. 1, 2009, hal. 257–266.

Z. Liu, W. Huang, Y. Zheng, dan M. Sun, “Automatic Keyphrase Extraction via Topic Decomposition,” Proc. 2010 Conf. Empir. Methods Nat. Lang. Process., 2010, hal. 366–376.

N. F. Azzahra, H. Ginardi, dan A. Saikhu, “Praproses Data Alir ADS-B dari Multi-Receiver dengan Pengelompokan Agglomerasi Berbasis Konsistensi Jarak,” JNTETI, Vol. 4, No. 1, hal. 39-44, 2015.

A. Krauza, “Extension of Fuzzy Gustafson-Kessel Algorithm Based on Adaptive Cluster Merging,” 2015 IEEE MIT Undergrad. Res. Technol. Conf. URTC 2015, 2016, hal. 1–4.

C. Jin dan Q. Bai, “Text Clustering Algorithm Based on the Graph Structures of Semantic Word Co-occurrence,” 2016 Int. Conf. Inf. Syst. Artif. Intell., 2016, hal. 497-502.

P. J. Rousseeuw, “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis,” J. Comput. Appl. Math., Vol. 20, hal. 53–65, 1987.

R. Gunawan dan K. Mustofa, “Pencarian Aturan Asosiasi Semantic Web untuk Obat Tradisional Indonesia,” J. Nas. Tek. Elektro dan Teknol. Informasi (JNTETI), Vol. 5, No. 3, hal. 192–200, 2016.