Named Entity Recognition in Statistical Dataset Search Queries

Wildannissa Pinasti; Lya Hulliyyatus Suadaa

doi:10.22146/jnteti.v13i3.11580

Wildannissa Pinasti Program Studi Komputasi Statistik, Politeknik Statistika STIS, Jakarta Timur, DKI Jakarta 13330, Indonesia
Lya Hulliyyatus Suadaa Program Studi Komputasi Statistik, Politeknik Statistika STIS, Jakarta Timur, DKI Jakarta 13330, Indonesia

DOI: https://doi.org/10.22146/jnteti.v13i3.11580

Keywords: Named Entity Recognition, Query, Dataset Search, Conditional Random Fields, Linked Open Data

Abstract

Search engines must understand user queries to provide relevant search results. Search engines can enhance their understanding of user intent by employing named entity recognition (NER) to identify the entity in the query. Knowing the types of entities in the query can be the initial step in helping search engines better understand search intent. In this research, a dataset was constructed using search query history from the Statistics Indonesia (Badan Pusat Statistik, BPS) website, and NER in query modeling was employed to extract entities from search queries related to statistical datasets. The research stages included query data collection, query data preprocessing, query data labeling, NER in query modeling, and model evaluation. The conditional random field (CRF) model was employed for NER in query modeling with two scenarios: CRF with basic features and CRF with basic features plus part of speech (POS) features. The CRF model was used due to its well-known effectiveness in natural language processing (NLP), particularly for tasks like NER with sequence labeling. In this research, the basic CRF and the CRF model with POS feature achieved an F1-score of 0.9139 and 0.9110, respectively. A case study on a Linked Open Data (LOD) statistical dataset indicated that searches with synonym query expansion on entities from NER in query produced better search results than regular searches without query expansion. The model's performance incorporating additional POS tagging features did not result in a significant improvement. Therefore, it is recommended that future research will elaborate on deep learning.

References

R. Song et al., “Identifying ambiguous queries in web search,” in WWW ’07, Proc. 16th Int. Conf. World Wide Web, 2007, pp. 1169–1170, doi: 10.1145/1242572.1242749.

W. Shen, J. Wang, and J. Han, “Entity linking with a knowledge base: Issues, techniques, and solutions,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 2, pp. 443–460, Feb. 2015, doi: 10.1109/TKDE.2014.2327028.

C. Bizer, T. Heath, and T. Berners-Lee, “Linked data - The story so far,” Int. J. Semant. Web Inf. Syst. (IJSWIS), vol. 5, no. 3, pp. 1–22, 2009, doi: 10.4018/jswis.2009081901.

B.R. Bhange et al., “Named entity recognition for e-commerce search queries,” 2020. Access date: 30-Jul-2022. [Online]. Available: https://sdm-dsre.github.io/pdf/named_entity.pdf

D. Nadeau and S. Sekine, “A survey of named entity recognition and classification,” Lingvistic. Investig., vol. 30, no. 1, pp. 3–26, Jan. 2007, doi: 10.1075/li.30.1.03nad.

P. Cheng and K. Erk, “Attending to entities for better text understanding,” in Proc. 34th AAAI Conf. Artif. Intell. (AAAI-20), 2020, pp. 7554–7561, doi: 10.1609/aaai.v34i05.6254.

D. Mollá, M. van Zaanen, and D. Smith, “Named entity recognition for question answering,” in Proc. 2006 Australas. Lang. Technol. Workshop (ALTW 2006), 2006, pp. 51–58.

J. Guo, G. Xu, X. Cheng, and H. Li, “Named entity recognition in query,” in SIGIR ’09, Proc. 32nd Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., 2009, pp. 267–274, doi: 10.1145/1571941.1571989.

B. Topcu and I.D. El-Kahlout, “TR-SEQ: Named entity recognition dataset for Turkish search engine queries,” in Proc. Recent Adv. Nat. Lang. Process., 2021, pp. 1417–1422, doi: 10.26615/978-954-452-072-4_158.

E. Kacprzak et al., “A query log analysis of dataset search,” in 17th Int. Conf. ICWE 2017, J. Cabot, R. De Virgilio, and R. Torlone, Eds., Cham, Switzerland: Springer, 2017, pp. 429–436, doi: 10.1007/978-3-319-60131-1_29.

B. Cowan et al., “Named entity recognition in travel-related search queries,” in Proc. 27th Conf. Innov. Appl. Artif. Intell., 2015, pp. 3935–3941, doi: 10.1609/aaai.v29i2.19050.

Y. Wen et al., “A survey on named entity recognition,” in Commun. Signal Process. Syst. (CSPS 2019), Q. Liang et al., Eds., Singapore, Singapore: Springer, 2019, pp. 1803–1810, doi: 10.1007/978-981-13-9409-6_218.

W. Khan et al., “Named entity recognition using conditional random fields,” Appl. Sci., vol. 12, no. 13, pp. 1–18, Jun. 2022, doi: 10.3390/app12136391.

A.S. Wibawa and A. Purwarianti, “Indonesian named-entity recognition for 15 classes using ensemble supervised learning,” Procedia Comput. Sci., vol. 81, pp. 221–228, May 2016, doi: 10.1016/j.procs.2016.04.053.

G.B. Herwanto and D.P. Dewantara, “Traffic condition information extraction from Twitter data,” in 2018 Int. Conf. Elect. Eng. Inform. (ICELTICs), 2018, pp. 95–100, doi: 10.1109/ICELTICS.2018.8548921.

R.M. Yanti, I. Santoso, and L.H. Suadaa, “Application of named entity recognition via Twitter on spaCy in Indonesian (Case study: Power failure in the Special Region of Yogyakarta),” Indones. J. Inf. Syst., vol. 4, no. 1, pp. 76–86, Aug. 2021, doi: 10.24002/ijis.v4i1.4677.

M.F.D.A. Putra, A.F. Hidayatullah, A.P. Wibowo, and K.R. Nastiti, “Named entity recognition on tourist destinations reviews in the Indonesian language,” J. Linguist. Komputasional, vol. 6, no. 1, pp. 30–35, Mar. 2023, doi: 10.26418/jlk.v6i1.89.

Y. Munarko et al., “Named entity recognition model for Indonesian tweet using CRF classifier,” in 2017 1st Int. Conf. Eng. Appl. Technol. (ICEAT), 2018, pp. 1–6, doi: 10.1088/1757-899X/403/1/012067.

J. Daiber, M. Jakob, C. Hokamp, and P.N. Mendes, “Improving efﬁciency and accuracy in multilingual entity extraction,” in I-SEMANTICS ’13, Proc. 9th Int. Conf. Semant. Syst., 2013, pp. 121–124, doi: 10.1145/2506182.2506198.

K. Denistia and R.H. Baayen, “The morphology of Indonesian: Data and quantitative modeling,” in The Routledge Handbook of Asian Linguistics, 1st ed. Oxfordshire, United Kingdom: Routledge, 2022.

M. Anandarajan, C. Hill, and T. Nolan, Practical Text Analytics: Maximizing the Value of Text Data, 1st ed. Cham, Switzerland: Springer, 2019.

B. Settles, “Biomedical named entity recognition using conditional random fields and rich feature sets,” in JNLPBA '04, Proc. Int. Jt. Workshop Nat. Lang. Process. Biomed. Appl., 2004, pp. 104–107, doi: 10.3115/1567594.1567618.

M. Poesio and R. Artstein, “The reliability of anaphoric annotation, reconsidered: Taking ambiguity into account,” in Proc. Workshop Front. Corpus Annot. II, Pie Sky, 2005, pp. 76–83.

R. Rifani, M.A. Bijaksana, and I. Asror, “Named entity recognition for an Indonesian based language tweet using multinomial naïve Bayes classifier,” Indo-JC (Indones. J. Comput.), vol. 4, no. 2, pp. 119–126, Sep. 2019, doi: 10.21108/indojc.2019.4.2.330.

A. Chiche and B. Yitagesu, “Part of speech tagging: A systematic review of deep learning and machine learning approaches,” J. Big Data, vol. 9, pp. 1–25, Jan. 2022, doi: 10.1186/s40537-022-00561-y.

J.D. Lafferty, A. McCallum, and F.C.N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in ICML ’01, Proc. 18th Int. Conf. Mach. Learn., 2001, pp. 282–289.

J. Li, A. Sun, J. Han, and C. Li, “A survey on deep learning for named entity recognition,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 1, pp. 50–70, Jan. 2022, doi: 10.1109/TKDE.2020.2981314.

B. DuCharme, Learning SPARQL: Querying and Updating with SPARQL 1.1, 2nd ed. Sebastopol, CA, USA: O’Reilly Media, 2013.

Username
Password
Remember me
Register