Optimization of the KNN Algorithm through Outlier Analysis Comparison (Distance, Density, LOF-Based)

  • Fitri Ayuning Tyas Program Studi Sistem Informasi, STMIK Muhammadiyah Paguyangan Brebes, Brebes, Jawa Tengah 52276, Indonesia
  • Mahda Nurayuni Program Studi Sistem Informasi, STMIK Muhammadiyah Paguyangan Brebes, Brebes, Jawa Tengah 52276, Indonesia
  • Hidayatur Rakhmawati Program Studi Sistem Informasi, STMIK Muhammadiyah Paguyangan Brebes, Brebes, Jawa Tengah 52276, Indonesia
Keywords: K-Nearest Neighbors, Outlier, Density, Distance, LOF, Friedman Test, Nemenyi Test

Abstract

The current data growth affects data analysis in various fields, such as astronomy, business, medicine, education, and finance. The collected and stored data contain extreme values or observation values different from most other observation value results. These extreme values are called outliers. Outliers on some data often hold valuable information, necessitating thorough examination to determine whether to retain or discard them prior to data mining application. Outlier detection can be performed as a part of data preprocessing using outlier analysis techniques. Commonly utilized outlier analysis techniques encompass distance-based methods, density-based methods, and the local outlier factor (LOF) method. k-nearest neighbors (KNN) are a data mining algorithm susceptible to outliers due to its reliance on the value of k. Hence, having an appropriate handling mechanism is essential when employing KNN on datasets that contain outliers. The experimental method was selected to apply the proposed approach, aiming to optimize the KNN algorithm through a comparison of outlier analysis methods (KNN-distance, KNN-density, and KNN-LOF). The results revealed that KNN-density outperformed the others significantly: achieving an average accuracy of 99.34% at k=3 and k=5 for Wisconsin Breast Cancer, 85.25% at k=7 for Glass, and 85.45% at k=5 for Lymphography. Moreover, both the Friedman and Nemenyi tests validate a notable distinction between KNN-density and KNN-LOF. 

References

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. Burlington, MA, USA: Morgan Kaufmann, 2012.

O. Alghushairy, R. Alsini, T. Soule, and X. Ma, “A review of local outlier factor algorithms for outlier detection in big data streams,” Big Data Cogn. Comput., vol. 5, no. 1, pp. 1–24, Mar. 2021, doi: 10.3390/bdcc5010001.

F. Gorunescu, Data Mining: Concepts, Models and Techniques. Heidelberg, Germany: Springer, 2011.

I.H. Witten, E. Frank, and M.A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed. Burlington, MA, USA: Morgan Kaufmann, 2011.

C.C. Aggarwal, Data Mining. Cham, Switzerland: Springer, 2015.

H. Liu and S. Zhang, “Noisy data elimination using mutual k-nearest neighbor for classification mining,” J. Syst. Softw., vol. 85, no. 5, pp. 1067–1024, May 2012, doi: 10.1016/j.jss.2011.12.019.

D. Armiady, “Analisis Metode DBSCAN (density-based spatial clustering of application with noise) dalam mendeteksi data outlier,” JURIKOM (J. Ris. Komput.), vol. 9, no. 6, pp. 2158–2164, Dec. 2022, doi: 10.30865/jurikom.v9i6.5080.

R. Silvi, “Analisis cluster dengan data outlier menggunakan centroid linkage dan k-means clustering untuk pengelompokan indikator HIV/AIDS di Indonesia,” J. Mat. MANTIK, vol. 4, no. 1, pp. 22–31, May 2018, doi: 10.15642/mantik.2018.4.1.22-31.

M.Y. Pusadan, “Outlier detection pada set data flight recording (pre-processing sumber data ADS-B),” Semin. Nas. Teknol. Inf. Multimedia 2015, 2015, pp. 2.1-31–2.1-36.

J. Abellán, J.G. Castellano, and C.J. Mantas, “A new robust classifier on noise domains: Bagging of credal C4.5 trees,” Complexity, vol. 2017, pp. 1–17, Dec. 2017, doi: 10.1155/2017/9023970.

A. Duraj and P.S. Szczepaniak, “Outlier detection in data streams — A comparative study of selected methods,” Procedia Comput. Sci., vol. 192, pp. 2769–2778, Oct. 2021, doi: 10.1016/j.procs.2021.09.047.

S. Sugidamayatno and D. Lelono, “Outlier detection credit card transactions using local outlier factor algorithm (LOF),” IJCCS (Indonesian J. Comput. Cybern. Syst.), vol. 13, no. 4, pp. 409–420, Oct. 2019, doi: 10.22146/ijccs.46561.

X. Xu, H. Liu, L. Li, and M. Yao, “A comparison of outlier detection techniques for high-dimensional data,” Int. J. Comput. Intell. Syst., vol. 11, no. 1, pp. 652–662, Jan. 2018, doi: 10.2991/ijcis.11.1.50.

T. Sangeetha and G.M. Amalanathan, “A fuzzy proximity relation approach for outlier detection in the mixed dataset by using rough entropy-based weighted density method,” Soft Comput. Lett., vol. 3, pp. 1–12, Dec. 2021, doi: 10.1016/j.socl.2021.100027.

H. Xu, L. Zhang, P. Li, and F. Zhu, “Outlier detection algorithm based on k-nearest neighbors-local outlier factor,” J. Algorithms Comput. Technol., vol. 16, pp. 1–12, Mar. 2022, doi: 10.1177/17483026221078111.

X. Wu et al., “Top 10 algorithms in data mining,” Knowl. Inf. Syst., vol. 14, no. 1, pp. 1–37, Jan. 2008, doi: 10.1007/s10115-007-0114-2.

Z. Deng et al., “Efficient kNN classification algorithm for big data,” Neurocomput., vol. 195, pp. 143–148, Jun. 2016, doi: 10.1016/j.neucom.2015.08.112.

S. Zhang et al., “Efficient kNN classification with different numbers of nearest neighbors,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 5, pp. 1774–1785, May 2018, doi: 10.1109/TNNLS.2017.2673241.

J. Ning, L. Chen, C. Zhou, and Y. Wen, “Parameter k search strategy in outlier detection,” Pattern Recognit. Lett., vol. 112, pp. 56–62, Sep. 2018, doi: 10.1016/j.patrec.2018.06.007.

O. Maimon and L. Rokach, Data Mining and Knowledge Discovery Handbook. New York, NY, USA: Springer, 2010.

P.A. Ariawan, “Optimasi pengelompokan data pada ketode k-means dengan analisis outlier,” J. Nas. Teknol. Sist. Inf., vol. 5, no. 2, pp. 88–95, Aug. 2019, doi: 10.25077/TEKNOSI.v5i2.2019.88-95.

H.C. Mandhare and S.R. Idate, “A comparative study of cluster based outlier detection, distance based outlier detection and density based outlier detection techniques,” 2017 Int. Conf. Intell. Comput. Control Syst. (ICICCS), 2017, pp. 931–935, doi: 10.1109/ICCONS.2017.8250601.

J. Yang, S. Rahardja, and P. Fränti, “Mean-shift outlier detection and filtering,” Pattern Recognit., vol. 115, pp. 1–11, Jul. 2021, doi: 10.1016/j.patcog.2021.107874.

H. Wang, M.J. Bah, and M. Hammad, “Progress in outlier detection techniques: A survey,” IEEE Access, vol. 7, pp. 107964–108000, Aug. 2019, doi: 10.1109/ACCESS.2019.2932769.

J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, Jan. 2006.

I.C. Negara and A. Prabowo, “Penggunaan uji chi–square untuk mengetahui pengaruh tingkat pendidikan dan umur terhadap pengetahuan penasun mengenai HIV–AIDS di Provinsi DKI Jakarta,” Pros. Senamantra (Semin. Nas. Mat. Terapannya), 2018, pp. 1–8.

L.F. Obe, D. Lalang, V. Lakapeni, and D. Fatin, “Pengaruh jumlah anak terhadap pendapatan hasil perkebunan kemiri di Desa Maikang Kecamatan Alor Selatan tahun 2020 menggunakan metode chi kuadrat,” J. Ilm. Wahana Pendidik., vol. 7, no. 6, pp. 378–384, Oct. 2021, doi: 10.5281/zenodo.5644452.

F. Akthar and C. Hahne, RapidMiner 5 Operator Reference. 2012.

S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient algorithms for mining outliers from large data sets,” Proc. 2000 ACM SIGMOD Int. Conf. Manag. Data, 2000, pp. 427–438, doi: 10.1145/342009.335437.

Z.A Bakar, R. Mohemad, A. Ahmad, and M.M. Deris, “A comparative study for outlier detection techniques in data mining,” 2006 IEEE Conf. Cybern. Intell. Syst., 2006, pp. 1–6, doi: 10.1109/ICCIS.2006.252287.

B. Tang and H. He, “A local density-based approach for outlier detection,” Neurocomput., vol. 241, pp. 171–180, Jun. 2017, doi: 10.1016/j.neucom.2017.02.039.

D. Kartini et al., “Perbandingan nilai k pada klasifikasi pneumonia anak balita,” J. Komput., vol. 10, no. 1, pp. 47–53, Apr. 2022, doi: 10.23960%2Fkomputasi.v10i1.2965.

R.M. Candra and A.N Rozana, “Klasifikasi komentar bullying pada Instagram menggunakan metode k-nearest neighbor,” IT J. Res. Dev., vol. 5, no. 1, pp. 45–52, Jul. 2020, doi: 10.25299/itjrd.2020.vol5(1).4962.

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. Burlington, MA, USA: Morgan Kaufmann, 2012.

M. Rivki and A.M. Bachtiar, “Implementasi algoritma k-nearest neighbor dalam pengklasifikasian follower Twitter yang menggunakan bahasa Indonesia,” J. Sist. Inf. (J. Inf. Syst.), vol. 13, no. 1, pp. 31–37, Apr. 2017, doi: 10.21609/jsi.v13i1.500.

A. Mahendra, “Pentapisan dan deteksi data outlier dalam proses sistem akusisi data pada proses sintering,” Arsitron, vol. 6, no. 1, pp. 1–7, Jun. 2015.

D. Handriyadi, M.A. Bijaksana, and E.B. Setiawan, “Analisis perbandingan clustering-based, distance-based dan density-based dalam mendeteksi outlier,” Semin. Nas. Apl. Teknol. Inf. (SNATI), 2009, pp. 101–108.

Published
2024-05-31
How to Cite
Fitri Ayuning Tyas, Mahda Nurayuni, & Hidayatur Rakhmawati. (2024). Optimization of the KNN Algorithm through Outlier Analysis Comparison (Distance, Density, LOF-Based). Jurnal Nasional Teknik Elektro Dan Teknologi Informasi, 13(2), 108-115. https://doi.org/10.22146/jnteti.v13i2.9579
Section
Articles