Enhancing Soil Liquefaction Prediction: Overcoming Data Challenges in SPT-Based Machine Learning with Imputation Technique
Abstract
In addition to the adverse effects of earthquakes, the loss of soil-bearing capacity during liquefaction can exacerbate damage to buildings. Liquefaction phenomena involve many parameters, making it more complex to evaluate. Machine learning has been studied to deal with liquefaction complexity in recent decades. However, incomplete liquefaction data can result in missing information, complicating model development across various datasets. Therefore, this study aims to assess the capability of machine learning models to predict liquefaction by implementing the missing value imputation technique. Seismicity, soil properties, and soil condition parameters were utilized to develop models. Random Forest (RF), k-Nearest Neighbor (k-NN), and eXtreme Gradient Boosting (XGBoost) were trained by applying feature selection and parameter optimization based on standard penetration test (SPT) data. The confusion matrix was used to assess the performance of the model based on the performance matrix of Overall Accuracy (OA), Precision (Prec), Recall (Rec), F1-Score (F1), and Area Under the Curve (AUC). In addition, the preprocessing stage included data normalization and outlier treatment to enhance the reliability of model predictions, ensuring consistent learning behavior across different variable scales. The results show that the RF achieved the highest performance (OA = 90.71%), which is comparable to findings from other previous studies. The AUC results indicate that the models deliver excellent classification performance. These findings suggest that the integration of imputation and preprocessing techniques can significantly improve data-driven approaches in geotechnical earthquake engineering. In conclusion, the missing imputation is quite effective in the predictive model. Finally, this study offers a new perspective on developing machine learning models using a more user-friendly software and applying imputation techniques to handle missing data.
References
Acharya, A., Prakash, A., Saxena, P. and Nigam, A. (2013), âSampling: why and how of it?â, Indian Journal of Medical Specialities 4(2). URL: https://doi.org/10.7713/ijms.2013.0032
Aggarwal, C. (2017), Outlier Analysis, Springer International Publishing, Cham. URL: https://doi.org/10.1007/978-3-319-47578-3
Aittokallio, T. (2010), âDealing with missing values in large-scale studies: microarray data imputation and beyondâ, Briefings in Bioinformatics 11(2), 253â264. URL: https://doi.org/10.1093/bib/bbp059
Boulanger, R. and Idriss, I. (2014), Cpt and spt based liquefaction triggering procedures, Technical Report Report No. UCD/CGM-14/01, Center for Geotechnical Modeling, University of California, Davis.
Breiman, L. (2001), âRandom forestsâ, Machine Learning 45(1), 5â32. URL: https://doi.org/10.1023/A:1010933404324
Can, R., Kocaman, S. and Gokceoglu, C. (2021), âA comprehensive assessment of xgboost algorithm for landslide susceptibility mapping in the upper basin of ataturk dam, turkeyâ, Applied Sciences 11(11), 4993. URL: https://doi.org/10.3390/app11114993
Cetin, K., Seed, R., Kayen, R., Moss, R., Bilge, H., Ilgac, M. and Chowdhury, K. (2018), âDataset on spt-based seismic soil liquefactionâ, Data in Brief 20, 544â548. URL: https://doi.org/10.1016/j.dib.2018.08.043
Chen, T. and Guestrin, C. (2016), Xgboost: A scalable tree boosting system, in âProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD â16)â, ACM, San Francisco, California, USA, pp. 785â794. URL: https://doi.org/10.1145/2939672.2939785
Cunningham, P. and Delany, S. (2022), âk-nearest neighbour classifiersâ, ACM Computing Surveys 54(6), 1â25. URL: https://doi.org/10.1145/3459665
Cutler, D. R., Edwards, T. C., Beard, K. H., Cutler, A., Hess, K. T., Gibson, J. and Lawler, J. J. (2007), âRandom forests for classification in ecologyâ, Ecology 88(11), 2783â2792. URL: https://doi.org/10.1890/07-0539.1
Demir, S. and Sahin, E. K. (2022a), âComparison of tree-based machine learning algorithms for predicting liquefaction potential using canonical correlation forest, rotation forest, and random forest based on cpt dataâ, Soil Dynamics and Earthquake Engineering 154, 107130. URL: https://doi.org/10.1016/j.soildyn.2021.107130
Demir, S. and Sahin, E. K. (2022b), âLiquefaction prediction with robust machine learning algorithms (svm, rf, and xgboost) supported by genetic algorithm-based feature selection and parameter optimization from the perspective of data processingâ, Environmental Earth Sciences 81(18), 459. URL: https://doi.org/10.1007/s12665-022-10578-4
Demir, S. and Sahin, E. K. (2023), âAn investigation of feature selection methods for soil liquefaction prediction based on tree-based ensemble algorithms using adaboost, gradient boosting, and xgboostâ, Neural Computing and Applications 35(4), 3173â3190. URL: https://doi.org/10.1007/s00521-022-07856-4
Dhal, P. and Azad, C. (2022), âA comprehensive survey on feature selection in the various fields of machine learningâ, Applied Intelligence 52(4), 4543â4581. URL: https://doi.org/10.1007/s10489-021-02550-9
Galupino, J. and Dungca, J. (2022), âMachine learning models to generate a subsurface soil profile: A case of makati city, philippinesâ, International Journal of GEOMATE 23(95). URL: https://doi.org/10.21660/2022.95.3372
Gandomi, A. H., Fridline, M. M. and Roke, D. A. (2013), âDecision tree approach for soil liquefaction assessmentâ, The Scientific World Journal 2013, 1â8. URL: https://doi.org/10.1155/2013/346285
GarcĂa-Laencina, P. J., Sancho-GĂłmez, J.-L. and Figueiras-Vidal, A. R. (2010), âPattern classification with missing data: a reviewâ, Neural Computing and Applications 19(2), 263â282. URL: https://doi.org/10.1007/s00521-009-0295-6
Genuer, R., Poggi, J.-M. and Tuleau-Malot, C. (2010), âVariable selection using random forestsâ, Pattern Recognition Letters 31(14), 2225â2236. URL: https://doi.org/10.1016/j.patrec.2010.03.014
Gorunescu, F. (2011), Data Mining, Intelligent Systems Reference Library, Springer Berlin Heidelberg, Berlin, Heidelberg. URL: https://doi.org/10.1007/978-3-642-19721-5
Gregorutti, B., Michel, B. and Saint-Pierre, P. (2017), âCorrelation and variable importance in random forestsâ, Statistics and Computing 27(3), 659â678. URL: https://doi.org/10.1007/s11222-016-9646-1
Hanna, A. M., Ural, D. and Saygili, G. (2007), âNeural network model for liquefaction potential in soil deposits using turkey and taiwan earthquake dataâ, Soil Dynamics and Earthquake Engineering 27(6), 521â540. URL: https://doi.org/10.1016/j.soildyn.2006.11.001
Hu, J. (2021), âData cleaning and feature selection for gravelly soil liquefactionâ, Soil Dynamics and Earthquake Engineering 145, 106711. URL: https://doi.org/10.1016/j.soildyn.2021.106711
Hu, J.-L., Tang, X.-W. and Qiu, J.-N. (2015), âA bayesian network approach for predicting seismic liquefaction based on interpretive structural modelingâ, Georisk: Assessment and Management of Risk for Engineered Systems and Geohazards 9(3), 200â217. URL: https://doi.org/10.1080/17499518.2015.1076570
Hu, J., Tan, Y. and Zou, W. (2021), âKey factors influencing earthquake-induced liquefaction and their direct and mediation effectsâ, PLOS ONE 16(2), e0246387. URL: https://doi.org/10.1371/journal.pone.0246387
Hu, J. and Wang, J. (2024), âA data extension framework of seismic-induced gravelly soil liquefaction based on semi-supervised methodsâ, Advanced Engineering Informatics 59, 102295. URL: https://doi.org/10.1016/j.aei.2023.102295
Hwang, J.-H. and Yang, C.-W. (2001), âVerification of critical cyclic strength curve by taiwan chi-chi earthquake dataâ, Soil Dynamics and Earthquake Engineering 21(3), 237â257. URL: https://doi.org/10.1016/S0267-7261(01)00002-1
Idriss, I. M. and Boulanger, R. W. (2008), Soil Liquefaction During Earthquakes, Earthquake Engineering Research Institute (EERI).
Khatti, J., Fissha, Y., Grover, K. S., Ikeda, H., Toriya, H., Adachi, T. and Kawamura, Y. (2024), âCone penetration test-based assessment of liquefaction potential using machine and hybrid learning approachesâ, Multiscale and Multidisciplinary Modeling, Experiments and Design 7(4), 3841â3864. URL: https://doi.org/10.1007/s41939-024-00447-x
Khatti, J. and Grover, K. S. (2024a), âAssessment of uniaxial strength of rocks: A critical comparison between evolutionary and swarm optimized relevance vector machine modelsâ, Transportation Infrastructure Geotechnology . URL: https://doi.org/10.1007/s40515-024-00433-3
Khatti, J. and Grover, K. S. (2024b), âPrediction of uniaxial strength of rocks using relevance vector machine improved with dual kernels and metaheuristic algorithmsâ, Rock Mechanics and Rock Engineering 57(8), 6227â6258. URL: https://doi.org/10.1007/s00603-024-03849-y
Kumar, D. R., Samui, P. and Burman, A. (2022), âPrediction of probability of liquefaction using soft computing techniquesâ, Journal of The Institution of Engineers (India): Series A 103(4), 1195â1208. URL: https://doi.org/10.1007/s40030-022-00683-9
Kumar, D. R., Samui, P. and Burman, A. (2023), âSuitability assessment of the best liquefaction analysis procedure based on spt dataâ, Multiscale and Multidisciplinary Modeling, Experiments and Design 6(2), 319â329. URL: https://doi.org/10.1007/s41939-023-00148-x
Kumar, D. R., Samui, P., Burman, A., Biswas, R. and Vanapalli, S. (2024), âA novel approach for assessment of seismic induced liquefaction susceptibility of soilâ, Journal of Earth System Science 133(3), 128. URL: https://doi.org/10.1007/s12040-024-02341-z
Kumar, D. R., Samui, P., Burman, A. and Kumar, S. (2024), âSeismically induced liquefaction potential assessment by different artificial intelligence proceduresâ, Transportation Infrastructure Geotechnology 11(3), 1272â1293. URL: https://doi.org/10.1007/s40515-023-00327-w
Kumar, D. R., Samui, P., Burman, A., Wipulanusat, W. and Keawsawasvong, S. (2023), âLiquefaction susceptibility using machine learning based on spt dataâ, Intelligent Systems with Applications 20, 200281. URL: https://doi.org/10.1016/j.iswa.2023.200281
Lin, W.-C. and Tsai, C.-F. (2020), âMissing value imputation: a review and analysis of the literature (2006â2017)â, Artificial Intelligence Review 53(2), 1487â1509. URL: https://doi.org/10.1007/s10462-019-09709-4
Mandhare, H. C. and Idate, S. R. (2017), A comparative study of cluster based outlier detection, distance based outlier detection and density based outlier detection techniques, in â2017 International Conference on Intelligent Computing and Control Systems (ICICCS)â, IEEE, Madurai, pp. 931â935. URL: https://doi.org/10.1109/ICCONS.2017.8250601
Manzali, Y., Barry, K., Flouchi, R., Balouki, Y. and Elfar, M. (2024), âA feature weighted k-nearest neighbor algorithm based on association rulesâ, Journal of Ambient Intelligence and Humanized Computing 15, 1â14. URL: https://doi.org/10.1007/s12652-024-04793-z
Nguyen, Q. H., Ly, H.-B., Ho, L. S., Al-Ansari, N., Le, H. V., Tran, V. Q., Prakash, I. and Pham, B. T. (2021), âInfluence of data splitting on performance of machine learning models in prediction of shear strength of soilâ, Mathematical Problems in Engineering 2021(1), 4832864. URL: https://doi.org/10.1155/2021/4832864
Paleczek, A., Grochala, D. and Rydosz, A. (2021), âArtificial breath classification using xgboost algorithm for diabetes detectionâ, Sensors 21(12), 4187. URL: https://doi.org/10.3390/s21124187
Pan, R., Yang, T., Cao, J., Lu, K. and Zhang, Z. (2015), âMissing data imputation by k nearest neighbours based on grey relational structure and mutual informationâ, Applied Intelligence 43(3), 614â632. URL: https://doi.org/10.1007/s10489-015-0666-x
Pham, B. T., Qi, C., Ho, L. S., Nguyen-Thoi, T., Al-Ansari, N., Nguyen, M. D., Nguyen, H. D., Ly, H.-B., Le, H. V. and Prakash, I. (2020), âA novel hybrid soft computing model using random forest and particle swarm optimization for estimation of undrained shear strength of soilâ, Sustainability 12(6), 2218. URL: https://doi.org/10.3390/su12062218
Probst, P., Wright, M. N. and Boulesteix, A. (2019), âHyperparameters and tuning strategies for random forestâ, WIREs Data Mining and Knowledge Discovery 9(3), e1301. URL: https://doi.org/10.1002/widm.1301
Puri, N., Prasad, H. D. and Jain, A. (2018), âPrediction of geotechnical parameters using machine learning techniquesâ, Procedia Computer Science 125, 509â517. URL: https://doi.org/10.1016/j.procs.2017.12.066
Ranjan, G. S. K., Kumar Verma, A. and Radhika, S. (2019), K-nearest neighbors and grid search cv based real time fault monitoring system for industries, in â2019 IEEE 5th International Conference for Convergence in Technology (I2CT)â, IEEE, Bombay, India, pp. 1â5. URL: https://doi.org/10.1109/I2CT45611.2019.9033691
Roy, M.-H. and Larocque, D. (2012), âRobustness of random forests for regressionâ, Journal of Nonparametric Statistics 24(4), 993â1006. URL: https://doi.org/10.1080/10485252.2012.715161
Sahin, E. K. and Demir, S. (2023), âGreedy-automl: A novel greedy-based stacking ensemble learning framework for assessing soil liquefaction potentialâ, Engineering Applications of Artificial Intelligence 119, 105732. URL: https://doi.org/10.1016/j.engappai.2022.105732
Samadi, H., Hassanpour, J., Rostami, J. and Khatti, J. (2024), Application of supervised learning algorithms to predict engineering characteristics of soft to strong rock masses using actual tbm performance data, in â58th U.S. Rock Mechanics/Geomechanics Symposiumâ, ARMA, Golden, Colorado, USA, p. D022S023R001. URL: https://doi.org/10.56952/ARMA-2024-0036
Seed, H. B. and Idriss, I. M. (1971), âSimplified procedure for evaluating soil liquefaction potentialâ, Journal of the Soil Mechanics and Foundations Division 97(9), 1249â1273. URL: https://doi.org/10.1061/JSFEAQ.0001662
Shi, X., Wong, Y. D., Chai, C. and Li, M. Z.-F. (2021), âAn automated machine learning (automl) method of risk prediction for decision-making of autonomous vehiclesâ, IEEE Transactions on Intelligent Transportation Systems 22(11), 7145â7154. URL: https://doi.org/10.1109/TITS.2020.3002419
Tang, L. and Na, S. (2021), âComparison of machine learning methods for ground settlement prediction with different tunneling datasetsâ, Journal of Rock Mechanics and Geotechnical Engineering 13(6), 1274â1289. URL: https://doi.org/10.1016/j.jrmge.2021.08.006
Theng, D. and Bhoyar, K. K. (2024), âFeature selection techniques for machine learning: a survey of more than two decades of researchâ, Knowledge and Information Systems 66(3), 1575â1637. URL: https://doi.org/10.1007/s10115-023-02010-5
Torres, E. S. and Dungcaa, J. R. (2024), âAn interpretable machine learning approach in understanding lateral spreading case historiesâ, International Journal of GEOMATE 26(116). URL: https://doi.org/10.21660/2024.116.g13159
Wang, Y. and Sherry Ni, X. (2019), âA xgboost risk model via feature selection and bayesian hyperparameter optimizationâ, International Journal of Database Management Systems 11(01), 01â17. URL: https://doi.org/10.5121/ijdms.2019.11101
Xie, Y., Ebad Sichani, M., Padgett, J. E. and DesRoches, R. (2020), âThe promise of implementing machine learning in earthquake engineering: A state-of-the-art reviewâ, Earthquake Spectra 36(4), 1769â1801. URL: https://doi.org/10.1177/8755293020919419
Xue, X., Yang, X. and Li, P. (2017), âApplication of a probabilistic neural network for liquefaction assessmentâ, Neural Network World 27(6), 557â567. URL: https://doi.org/10.14311/NNW.2017.27.030
Ye, Y., Wu, Q., Huang, J. Z., Ng, M. K. and Li, X. (2013), âStratified sampling for feature subspace selection in random forests for high dimensional dataâ, Pattern Recognition 46(3), 769â787. URL: https://doi.org/10.1016/j.patcog.2012.09.022
Youd, T. L., Idriss, I. M., Andrus, R. D., Arango, I., Castro, G., Christian, J. T., Dobry, R., Finn, W. D. L., Harder, L. F., J., Hynes, M. E., Ishihara, K., Koester, J. P., Liao, S. S. C., Marcuson, W. F., I., Martin, G. R., Mitchell, J. K., Moriwaki, Y., Power, M. S., Robertson, P. K., Seed, R. B. and Stokoe, K. H., I. (2001), âLiquefaction resistance of soils: Summary report from the 1996 nceer and 1998 nceer/nsf workshops on evaluation of liquefaction resistance of soilsâ, Journal of Geotechnical and Geoenvironmental Engineering 127(4), 297â313. URL: https://doi.org/10.1061/(ASCE)1090-0241(2001)127:4(297)
Zakariya, A., Rifaâi, A. and Ismanti, S. (2023), âThe correlation of liquefaction potential and probability on excess pore water pressure in kretek 2 bridge areaâ, Journal of the Civil Engineering Forum pp. 39â48. URL: https://doi.org/10.22146/jcef.7002
Zhang, J. and Wang, Y. (2021), âAn ensemble method to improve prediction of earthquake induced soil liquefaction: a multi-dataset studyâ, Neural Computing and Applications 33(5), 1533â1546. URL: https://doi.org/10.1007/s00521-020-05086-6
Zhang, P., Jia, Y. and Shang, Y. (2022), âResearch and application of xgboost in imbalanced dataâ, International Journal of Distributed Sensor Networks 18(6), 155013292211069. URL: https://doi.org/10.1155/2022/1550132
Zhao, Z., Duan, W. and Cai, G. (2021), âA novel pso-kelm based soil liquefaction potential evaluation system using cpt and vs measurementsâ, Soil Dynamics and Earthquake Engineering 150, 106930. URL: https://doi.org/10.1016/j.soildyn.2021.106930
Zhao, Z., Duan, W., Cai, G., Wu, M. and Liu, S. (2022), âCpt-based fully probabilistic seismic liquefaction potential assessment to reduce uncertainty: Integrating xgboost algorithm with bayesian theoremâ, Computers and Geotechnics 149, 104868. URL: https://doi.org/10.1016/j.compgeo.2022.104868
Zhao, Z., Duan, W., Cai, G., Wu, M., Liu, S. and Puppala, A. J. (2024), âProbabilistic capacity energy-based machine learning models for soil liquefaction reliability analysisâ, Engineering Geology 338, 107613. URL: https://doi.org/10.1016/j.enggeo.2024.107613
Copyright (c) 2026 The Author(s)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Copyright is granted to authors for the purpose of providing protection for articles written to describe experiments and their results. JCEF will protect and defend the work and reputation of the author and are also willing to address any allegations of violation, plagiarism, fraud, etc. against articles written and published by JCEF. JCEF is published under the terms of the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). The author holds the copyright and assigns the journal rights to the first publication (online and print) of the work simultaneously.




