Impact of Data Balancing and Feature Engineering on Accident Severity Models

Authors

  • Fayez ALANAZI Jouf University, College of Engineering, Civil Engineering Department
  • Aminu SULEIMAN Bayero University Kano, Faculty of Engineering, Department of Civil Engineering

DOI:

https://doi.org/10.7307/ptt.v37i3.856

Keywords:

accident severity, traffic safety, machine learning, feature engineering, data balancing, ADASYN, AutoML

Abstract

This study investigates the impacts of feature engineering techniques, including Clustering, Target Encoding and Anomaly Detection, in conjunction with data balancing methods, on the efficacy of machine learning models for predicting road accident severity. Automated Machine Learning (AutoML), Distributed Random Forest (DRF), Boosted Regression Trees (BRT) and Deep Learning models were evaluated on datasets that were balanced using the SMOTE (Synthetic Minority Over-Sampling Technique) and ADASYN (Adaptive Synthetic Sampling) techniques. Evaluation metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Log Loss, Area under the Curve (AUC), and Area under the Precision-Recall Curve (AUCPR) are employed. Results reveal that the AutoML consistently outperforms other models, achieving an 85% accuracy in predicting fatal accidents and 94% accuracy in predicting injuries. Deep Learning excels in injury accident prediction, with a 95% accuracy, but faces challenges with fatalities, achieving a 60% accuracy. The study underscores the critical role of feature engineering techniques and data balancing methods in enhancing predictive accuracy for accident severity classification. Specifically, the incorporation of Clustering, Target Encoding and Anomaly Detection techniques alongside SMOTE and ADASYN balancing methods significantly improves the model performance. Further refinement and validation are crucial for optimising model performance in real-world traffic safety management applications.

References

Alsofayan YM, et al. Do crashes happen more frequently at sunset in Ramadan than the rest of the year? Journal of Taibah University Medical Sciences. 2022;17(6):1031–1038. DOI: 10.1016/j.jtumed.2022.06.002.

Chen S, et al. The global macroeconomic burden of road injuries: Estimates and projections for 166 countries. Lancet Planet Health. 2019;3(9):e390 –e398. DOI: 10.1016/S2542-5196(19)30170-6.

World Bank. The high toll of traffic injuries: Unacceptable and preventable. 2017. DOI: 10.1596/29129.

Al-Madani HMN. Fatal crashes in GCC countries: Comparative analysis with EU countries for three decades. In: Proceedings of SAFE 2013. Rome, Italy; 2013. p. 471–482. DOI: 10.2495/SAFE130421.

Awadalla DM, de Albuquerque FDB. Fatal road crashes in the emirate of Abu Dhabi: Contributing factors and data-driven safety recommendations. Transportation Research Procedia. 2021;52:260–267. DOI: 10.1016/j.trpro.2021.01.030.

Bener A, et al. The impact of four-wheel drive on risky driver behaviours and road traffic accidents. Transportation Research Part F: Traffic Psychology and Behaviour. 2008;11(5):324–333. DOI: 10.1016/j.trf.2008.02.001.

de Albuquerque FDB, Awadalla DM. Characterization of road crashes in the emirate of Abu Dhabi. Transportation Research Procedia. 2020;48:1095–1110. DOI: 10.1016/j.trpro.2020.08.136.

Fadel H, et al. Vision zero: The journey to safer roads in the Middle East. Federation Internationale de l’Automobile. https://www.fia.com/news/gcc-countries-could-significantly-reduce-annual-road-traffic-fatalities-and-boost-economic [Accessed 3rd Aug. 2024].

Rohrer WM. Road traffic accidents as public health challenge in the Gulf Cooperation Council (GCC) region. In: Public Health - Open Journal. 2016. p. e6–e7. DOI: 10.17140/PHOJ-1-e004.

Saudi Vision 2030. National transformation program. Vision 2030 Kingdom of Saudi Arabia. http://www.vision2030.gov.sa/en/vision-2030/vrp/national-transformation-program/ [Accessed 3rd Aug. 2024].

Alotaibi O, Potoglou D. Introducing public transport and relevant strategies in Riyadh City, Saudi Arabia: A stakeholders’ perspective. Urban, Planning and Transport Research. 2018;6(1):35–53. DOI: 10.1080/21650020.2018.1463867.

Moser S, Swain M, Alkhabbaz MH. King Abdullah economic city: Engineering Saudi Arabia’s post-oil future. Cities. 2015;45:71–80. DOI: 10.1016/j.cities.2015.03.001.

Alghnam S, et al. Healthcare costs of road injuries in Saudi Arabia: A quantile regression analysis. Accident Analysis & Prevention. 2021;159:106266. DOI: 10.1016/j.aap.2021.106266.

Jamal A, Rahman MT, Al-Ahmadi HM, Mansoor U. The dilemma of road safety in the eastern province of Saudi Arabia: Consequences and prevention strategies. International Journal of Environmental Research and Public Health. 2020;17(1):Article 157. DOI: 10.3390/ijerph17010157.

Wahaq AB, Bawazir A. Female drivers’ attitudes and behavior regarding traffic regulations in Riyadh, Saudi Arabia. Research Square. 2021. DOI: 10.21203/rs.3.rs-179510/v1. https://www.researchsquare.com/article/rs-179510/v1 [Accessed 15th Feb. 2021].

World Health Organization. Reducing road crash deaths in the Kingdom of Saudi Arabia. 2023. https://www.who.int/news/item/20-06-2023-reducing-road-crash-deaths-in-the-Kingdom-of-Saudi-Arabia [Accessed 27th Sep. 2023].

Safarpour H, et al. The common road safety approaches: A scoping review and thematic analysis. Chinese Journal of Traumatology. 2020;23(2):113–121. DOI: 10.1016/j.cjtee.2020.02.005.

Smith T. Fundamentals of the safe system approach. Vision Zero Network. 2024. https://visionzeronetwork.org/fundamentals-of-the-safe-system-approach/ [Accessed 16th Apr. 2024].

Biddala SCR, Ibikunle O, Duffy VG. Systematic review on safety of artificial intelligence and transportation. In: Duffy VG, Krömker H, Streitz NA, Konomi S, editors. HCI International 2023 – Late Breaking Papers. Cham: Springer Nature Switzerland; 2023. p. 248–263. DOI: 10.1007/978-3-031-48047-8_16.

Alqahtani H, Kumar G. Machine learning for enhancing transportation security: A comprehensive analysis of electric and flying vehicle systems. Engineering Applications of Artificial Intelligence. 2024;129:107667. DOI: 10.1016/j.engappai.2023.107667.

Tselentis DI, et al. The usefulness of artificial intelligence for safety assessment of different transport modes. Accident Analysis & Prevention. 2023;186:107034. DOI: 10.1016/j.aap.2023.107034.

Li Z, Liu P, Wang W, Xu C. Using support vector machine models for crash injury severity analysis. Accident Analysis & Prevention. 2012;45:478–486. DOI: 10.1016/j.aap.2011.08.016.

Alsrehin NO, Klaib AF, Magableh A. Intelligent transportation and control systems using data mining and machine learning techniques: A comprehensive study. IEEE Access. 2019;7:49830–49857. DOI: 10.1109/ACCESS.2019.2909114.

Neilson A, Indratmo, Daniel B, Tjandra S. Systematic review of the literature on big data in the transportation domain: Concepts and applications. Big Data Research. 2019;17:35–44. DOI: 10.1016/j.bdr.2019.03.001.

Tilahun N. Safety impact of automated speed camera enforcement: Empirical findings based on Chicago’s speed cameras. Transportation Research Record: Journal of the Transportation Research Board. 2023;2677(1):1490–1498. DOI: 10.1177/03611981221104808.

Kalambay P, Pulugurtha SS. Data-driven exploration of traffic speed patterns to identify potential road links for variable speed limit sign implementation. Urban, Planning and Transport Research. 2024;12(1):2319711. DOI: 10.1080/21650020.2024.2319711.

Li H, Zhang Y, Ren G. A causal analysis of time-varying speed camera safety effects based on the propensity score method. Journal of Safety Research. 2020;75:119–127. DOI: 10.1016/j.jsr.2020.08.007.

Aghayari H, et al. Mobile applications for road traffic health and safety in the mirror of the Haddon’s matrix. BMC Medical Informatics and Decision Making. 2021;21(1):230. DOI: 10.1186/s12911-021-01578-8.

Zhang Z, Xu N, Liu J, Jones S. Exploring spatial heterogeneity in factors associated with injury severity in speeding-related crashes: An integrated machine learning and spatial modeling approach. Accident Analysis & Prevention. 2024;206:107697. DOI: 10.1016/j.aap.2024.107697.

Alkheder S, et al. Severity prediction of traffic accident using an artificial neural network. Journal of Forecasting. 2017;36(1):100–108. DOI: 10.1002/for.2425.

Ijaz M, Lan L, Zahid M, Jamal A. A comparative study of machine learning classifiers for injury severity prediction of crashes involving three-wheeled motorized rickshaw. Accident Analysis & Prevention. 2021;154:106094. DOI: 10.1016/j.aap.2021.106094.

Maghelal P, et al. Severity of vehicle-to-vehicle accidents in the UAE: An exploratory analysis using machine learning algorithms. Heliyon. 2023;9(10):e20694 . DOI: 10.1016/j.heliyon.2023.e20694.

Mohamed SA, Kishta M, Al-Harthi HA. Investigating factors affecting the occurrence and severity of rear-end crashes. Transportation Research Procedia. 2017;25:2098–2107. DOI: 10.1016/j.trpro.2017.05.403.

Panda C, Mishra AK, Dash AK, Nawab H. Predicting and explaining severity of road accident using artificial intelligence techniques, SHAP and feature analysis. International Journal of Crashworthiness. 2023;28(2):186–201. DOI: 10.1080/13588265.2022.2074643.

Taamneh S, Taamneh M. Evaluation of the performance of random forests technique in predicting the severity of road traffic accidents. In: Stanton N, editor. Advances in Human Aspects of Transportation. Cham: Springer International Publishing; 2019. p. 840–847. DOI: 10.1007/978-3-319-93885-1_78.

Júnior JF, et al. Driver behavior profiling: An investigation with different smartphone sensors and machine learning. PLOS ONE. 2017;12(4):e0174959 . DOI: 10.1371/journal.pone.0174959.

Silva PB, Andrade M, Ferreira S. Machine learning applied to road safety modeling: A systematic literature review. Journal of Traffic and Transportation Engineering (English Edition). 2020;7(6):775–790. DOI: 10.1016/j.jtte.2020.07.004.

Zhang Z, et al. Machine learning based real-time prediction of freeway crash risk using crowdsourced probe vehicle data. Journal of Intelligent Transportation Systems. 2024;28(1):84–102. DOI: 10.1080/15472450.2022.2106564.

Sarigiannis D, et al. Feature engineering and decision trees for predicting high crash-risk locations using roadway indicators. Transportation Research Record. 2024. DOI: 10.1177/03611981231217497.

Qamar R, Zardari BA. Artificial neural networks: An overview. Mesopotamian Journal of Computer Science. 2023;2023:130–139. DOI: 10.58496/MJCSC/2023/015..

Zhang Y, Li H, Ren G. Estimating heterogeneous treatment effects in road safety analysis using generalized random forests. Accident Analysis & Prevention. 2022;165:106507. DOI: 10.1016/j.aap.2021.106507.

Schlögl M, et al. A comparison of statistical learning methods for deriving determining factors of accident occurrence from an imbalanced high resolution dataset. Accident Analysis & Prevention. 2019;127:134–149. DOI: 10.1016/j.aap.2019.02.008.

Muzahid AJM, et al. Deep reinforcement learning-based driving strategy for avoidance of chain collisions and its safety efficiency analysis in autonomous vehicles. IEEE Access. 2022;10:43303–43319. DOI: 10.1109/ACCESS.2022.3167812.

Sun Z, et al. A hybrid approach of random forest and random parameters logit model of injury severity modeling of vulnerable road users involved crashes. Accident Analysis & Prevention. 2023;192:107235. DOI: 10.1016/j.aap.2023.107235.

Yang Z, Zhang W, Feng J. Predicting multiple types of traffic accident severity with explanations: A multi-task deep learning framework. Safety Science. 2022;146:105522. DOI: 10.1016/j.ssci.2021.105522.

Xie Y. Values and limitations of statistical models. Research in Social Stratification and Mobility. 2011;29(3):343–349. DOI: 10.1016/j.rssm.2011.04.001.

Mannering FL, Shankar V, Bhat CR. Unobserved heterogeneity and the statistical analysis of highway accident data. Analytic Methods in Accident Research. 2016;11:1–16. DOI: 10.1016/j.amar.2016.04.001.

Wang C, Shao Y, Ye F, Zhu T. Injury severity analysis of e-bike riders in China based on the in-vehicle recording video crash data: A random parameter ordered logit model. International Journal of Injury Control and Safety Promotion. 2024:1–11. DOI: 10.1080/17457300.2024.2385102.

Savolainen PT, et al. The statistical analysis of highway crash-injury severities: A review and assessment of methodological alternatives. Accident Analysis & Prevention. 2011;43(5):1666–1676. DOI: 10.1016/j.aap.2011.03.025.

Ye F, et al. Investigating the severity of expressway crash based on the random parameter logit model accounting for unobserved heterogeneity. Advances in Mechanical Engineering. 2021;13(12):16878140211067278. DOI: 10.1177/16878140211067278.

Zhang S, et al. Hybrid feature selection-based machine learning classification system for the prediction of injury severity in single and multiple-vehicle accidents. PLOS ONE. 2022;17(2):e0262941 . DOI: 10.1371/journal.pone.0262941.

Akin D, et al. Identifying causes of traffic crashes associated with driver behavior using supervised machine learning methods: Case of highway 15 in Saudi Arabia. Sustainability. 2022;14(24):16654. DOI: 10.3390/su142416654.

Aldhari I, et al. Severity prediction of highway crashes in Saudi Arabia using machine learning techniques. Applied Sciences. 2023;13(1):233. DOI: 10.3390/app13010233.

Bachir H, Almannaa M. Crash severity predictive models using machine learning algorithms: A case study of Riyadh, Saudi Arabia. In: Proceedings of the 13th Annual International Conference on Industrial Engineering and Operations Management. Manila, Philippines: IEOM Society; 2023. DOI: 10.46254/AN13.20230244.

Aboulola OI. Improving traffic accident severity prediction using MobileNet transfer learning model and SHAP XAI technique. PLOS ONE. 2024;19(4):e0300640 . DOI: 10.1371/journal.pone.0300640.

Alrajhi M, Kamel M. A deep-learning model for predicting and visualizing the risk of road traffic accidents in Saudi Arabia: A tutorial approach. International Journal of Advanced Computer Science and Applications. 2019;10. DOI: 10.14569/IJACSA.2019.0101166.

Wang H, et al. An interpretable deep embedding model for few and imbalanced biomedical data. IEEE Journal of Biomedical and Health Informatics. 2022:1–8. DOI: 10.1109/JBHI.2022.3223798.

Wen X, et al. Applications of machine learning methods in traffic crash severity modelling: Current status and future directions. Transport Reviews. 2021;41(6):855–879. DOI: 10.1080/01441647.2021.1954108.

Gao Y, Zhu Y, Zhao Y. Dealing with imbalanced data for interpretable defect prediction. Information and Software Technology. 2022;151:107016. DOI: 10.1016/j.infsof.2022.107016.

Zheng A, Casari A. Feature engineering for machine learning: Principles and techniques for data scientists. O’Reilly Media; 2018.

Fiorentini N, Losa M. Handling imbalanced data in road crash severity prediction by machine learning algorithms. Infrastructures. 2020;5(7):61. DOI: 10.3390/infrastructures5070061.

Mohammadpour SI, et al. Classification of truck-involved crash severity: Dealing with missing, imbalanced, and high dimensional safety data. PLOS ONE. 2023;18(3):e0281901 . DOI: 10.1371/journal.pone.0281901.

Ogungbire A, Pulugurtha SS. Effectiveness of data imbalance treatment in weather-related crash severity analysis. Transportation Research Record. 2024. DOI: 10.1177/03611981241239962.

Sarkar S, et al. Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data. Safety Science. 2020;125:104616. DOI: 10.1016/j.ssci.2020.104616.

Ali Y, Hussain F, Haque MM. Advances, challenges, and future research needs in machine learning-based crash prediction models: A systematic review. Accident Analysis & Prevention. 2024;194:107378. DOI: 10.1016/j.aap.2023.107378.

Li G, et al. ReMAHA–CatBoost: Addressing imbalanced data in traffic accident prediction tasks. Applied Sciences. 2023;13(24):13123. DOI: 10.3390/app132413123.

Morris C, Yang JJ. Effectiveness of resampling methods in coping with imbalanced crash data: Crash type analysis and predictive modeling. Accident Analysis & Prevention. 2021;159:106240. DOI: 10.1016/j.aap.2021.106240.

Mohamed MG, et al. A clustering regression approach: A comprehensive injury severity analysis of pedestrian–vehicle crashes in New York, US and Montreal, Canada. Safety Science. 2013;54:27–37. DOI: 10.1016/j.ssci.2012.11.001.

Wang K, Xue Q, Lu JJ. Risky driver recognition with class imbalance data and automated machine learning framework. International Journal of Environmental Research and Public Health. 2021;18(14):7534. DOI: 10.3390/ijerph18147534.

Fernandez A, et al. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research. 2018;61:863–905. DOI: 10.1613/jair.1.11192.

Tang B, He H. KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning. In: 2015 IEEE Congress on Evolutionary Computation (CEC). 2015. p. 664–671. DOI: 10.1109/CEC.2015.7256954.

Chawla NV, et al. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16:321–357. DOI: 10.1613/jair.953.

Sharma S, Gosain A, Jain S. A review of the oversampling techniques in class imbalance problem. In: Khanna A, et al., editors. International Conference on Innovative Computing and Communications. Singapore: Springer; 2022. p. 459–472. DOI: 10.1007/978-981-16-2594-7_38.

Devi D, Biswas SK, Purkayastha B. A review on solution to class imbalance problem: Undersampling approaches. In: 2020 International Conference on Computational Performance Evaluation (ComPE). 2020. p. 626–631. DOI: 10.1109/ComPE49325.2020.9200087.

Hasanin T, et al. Investigating random undersampling and feature selection on bioinformatics big data. In: 2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService). 2019. p. 346–356. DOI: 10.1109/BigDataService.2019.00063.

Hasanin T, Khoshgoftaar T. The effects of random undersampling with simulated class imbalance for big data. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI). 2018. p. 70–79. DOI: 10.1109/IRI.2018.00018.

Mohammed R, Rawashdeh J, Abdullah M. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In: 2020 11th International Conference on Information and Communication Systems (ICICS). 2020. p. 243–248. DOI: 10.1109/ICICS49469.2020.239556.

He H, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008. p. 1322–1328. DOI: 10.1109/IJCNN.2008.4633969.

Jian C, Gao J, Ao Y. A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing. 2016;193:115–122. DOI: 10.1016/j.neucom.2016.02.006.

Louppe G. Understanding random forests: From theory to practice. arXiv. 2015. arXiv:1407.7502. DOI: 10.48550/arXiv.1407.7502.

Fernández A, et al. Cost-sensitive learning. In: Fernández A, et al., editors. Learning from Imbalanced Data Sets. Cham: Springer International Publishing; 2018. p. 63–78. DOI: 10.1007/978-3-319-98074-4_4.

Pereira J, Saraiva F. A comparative analysis of unbalanced data handling techniques for machine learning algorithms to electricity theft detection. In: 2020 IEEE Congress on Evolutionary Computation (CEC). 2020. p. 1–8. DOI: 10.1109/CEC48606.2020.9185822.

Heaton J. An empirical analysis of feature engineering for predictive modeling. In: SoutheastCon 2016. 2016. p. 1–6. DOI: 10.1109/SECON.2016.7506650.

Buian MFI, et al. Advanced analytics for predicting traffic collision severity assessment. World Journal of Advanced Research and Reviews. 2024;21(2):2007–2018. DOI: 10.30574/wjarr.2024.21.2.0704.

Ikotun AM, et al. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences. 2023;622:178–210. DOI: 10.1016/j.ins.2022.11.139.

H2O.ai. Target encoding — H2O 3.46.0.1 documentation. 2024. https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/target-encoding.html [Accessed 3rd May 2024].

Thudumu S, et al. A comprehensive survey of anomaly detection techniques for high dimensional big data. Journal of Big Data. 2020;7(1):42. DOI: 10.1186/s40537-020-00320-x.

Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–118. DOI: 10.1093/bioinformatics/btr597.

Hong S, Lynn HS. Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Medical Research Methodology. 2020;20(1):199. DOI: 10.1186/s12874-020-01080-1.

Li LI, Goshawk DP. Comparison of random forest and multiple imputation for imputing missing data: A case study of the education panel survey of the City of China.2015. https://www.albany.edu/chinanet/events/ucrn2016/papers/18_Comparison%20of%20Random%20Forest%20and%20Multiple%20Imputation%20for%20Imputing%20Missing%20Data.pdf [Accessed 10th May 2024].

Micci-Barreca D. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explorations Newsletter. 2001;3(1):27–32. DOI: 10.1145/507533.507538.

Prokhorenkova L, et al. CatBoost: Unbiased boosting with categorical features. In: Advances in Neural Information Processing Systems. 2018. https://proceedings.neurips.cc/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html [Accessed 10th May 2024].

Pargent F, et al. Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics. 2022;37(5):2671–2692. DOI: 10.1007/s00180-022-01207-6.

Branco P, Ribeiro RP, Torgo L. UBL: An R package for Utility-based Learning. arXiv. 2016. arXiv:1604.08079. http://arxiv.org/abs/1604.08079 [Accessed 3rd May 2024].

Alex SA, Nayahi JJV, Kaddoura S. Deep convolutional neural networks with genetic algorithm-based synthetic minority over-sampling technique for improved imbalanced data classification. Applied Soft Computing. 2024;156:111491. DOI: 10.1016/j.asoc.2024.111491.

Malhotra R, Kamal S. An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing. 2019;343:120–140. DOI: 10.1016/j.neucom.2018.04.090.

Botelho AF, Baker RS, Heffernan NT. Machine-learned or expert-engineered features? Exploring feature engineering methods in detectors of student behavior and affect. Twelfth International Conference on Educational Data Mining. 2019. https://par.nsf.gov/biblio/10108548 [Accessed 10th May 2024].

Yassin SS, Pooja. Road accident prediction and model interpretation using a hybrid K-means and random forest algorithm approach. SN Applied Sciences. 2020;2(9):1576. DOI: 10.1007/s42452-020-3125-1.

Bridgelall R, Tolliver DD. Railroad accident analysis by machine learning and natural language processing. Journal of Rail Transport Planning & Management. 2024;29:100429. DOI: 10.1016/j.jrtpm.2023.100429.

Suh Y, Song B. Narrative texts-based anomaly detection using accident report documents: The case of chemical process safety. Journal of Loss Prevention in the Process Industries. 2019;57:47–54. DOI: 10.1016/j

Katya E. Exploring feature engineering strategies for improving predictive models in data science. Research Journal of Computer Systems and Engineering. 2023;4(2). DOI: 10.52710/rjcse.88.

Shi X, et al. A feature learning approach based on XGBoost for driving assessment and risk prediction. Accident Analysis & Prevention. 2019;129:170–179. DOI: 10.1016/j.aap.2019.05.005.

Anderson TK. Kernel density estimation and K-means clustering to profile road accident hotspots. Accident Analysis & Prevention. 2009;41(3):359–364. DOI: 10.1016/j.aap.2008.12.014.

Kazmi SSA, Ahmed M, Mumtaz R, Anwar Z. Spatiotemporal clustering and analysis of road accident hotspots by exploiting GIS technology and kernel density estimation. The Computer Journal. 2022;65(2):155–176. DOI: 10.1093/comjnl/bxz158.

James G, et al. An introduction to statistical learning. Vol. 112. New York: Springer; 2021.

Sterkenburg M. Theoretical and practical aspects of isolation forest. Master’s thesis. Utrecht University; 2022. https://studenttheses.uu.nl/handle/20.500.12932/42666 [Accessed 4th May 2024].

Laskar MTR, et al. Extending isolation forest for anomaly detection in big data via K-means. ACM Transactions on Cyber-Physical Systems. 2021;5(4):41:1–41:26. DOI: 10.1145/3460976.

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–444. DOI: 10.1038/nature14539.

Cevid D, et al. Distributional random forests: Heterogeneity adjustment and multivariate distributional regression. Journal of Machine Learning Research. 2022;23(333):1–79. http://jmlr.org/papers/v23/21-0585.html [Accessed 10th May 2024].

Friedman JH. Greedy function approximation: A gradient boosting machine. The Annals of Statistics. 2001;29(5):1189–1232. DOI: 10.1214/aos/1013203451.

Konstantinov AV, Utkin LV. Interpretable machine learning with an ensemble of gradient boosting machines. Knowledge-Based Systems. 2021;222:106993. DOI: 10.1016/j.knosys.2021.106993.

Oyedele A, et al. Deep learning and boosted trees for injuries prediction in power infrastructure projects. Applied Soft Computing. 2021;110:107587. DOI: 10.1016/j.asoc.2021.107587.

Azevedo K, et al. A multivocal literature review on the benefits and limitations of automated machine learning tools. arXiv. 2024. arXiv:2401.11366. DOI: 10.48550/arXiv.2401.11366.

Baykal T, et al. Accident severity prediction in big data using auto-machine learning. Scientia Iranica. 2023. DOI: 10.24200/sci.2023.60144.6626.

Yates LA, et al. Cross validation for model selection: A review with examples from ecology. Ecological Monographs. 2023;93(1):e1557 . DOI: 10.1002/ecm.1557.

Mbelwa J, et al. The effect of hyperparameter optimization on the estimation of performance metrics in network traffic prediction using the gradient boosting machine model. 2023. DOI: 10.48084/etasr.5548.

Naser MZ, Alavi A. Insights into performance fitness and error metrics for machine learning. 2020. DOI: 10.1007/s44150-021-00015-8.

Vujovic ŽÐ. Classification model evaluation metrics. International Journal of Advanced Computer Science and Applications. 2021;12(6). DOI: 10.14569/IJACSA.2021.0120670.

Smys S, Chen JIZ, Shakya S. Survey on neural network architectures with deep learning. Journal of Soft Computing Paradigm. 2020;2(3):186–194. https://www.academia.edu/download/70861228/06.pdf [Accessed 4th May 2024].

Li Z, He Q, Li J. A survey of deep learning-driven architecture for predictive maintenance. Engineering Applications of Artificial Intelligence. 2024;133:108285. DOI: 10.1016/j.engappai.2024.108285.

Hussain H, et al. Design possibilities and challenges of DNN models: A review on the perspective of end devices. Artificial Intelligence Review. 2022;55(7):5109–5167. DOI: 10.1007/s10462-022-10138-z.

Najafabadi MM, et al. Deep learning applications and challenges in big data analytics. Journal of Big Data. 2015;2(1):1. DOI: 10.1186/s40537-014-0007-7.

Mehmood F, Ahmad S, Whangbo TK. An efficient optimization technique for training deep neural networks. Mathematics. 2023;11(6):1360. DOI: 10.3390/math11061360.

Shrestha A, Mahmood A. Review of deep learning algorithms and architectures. IEEE Access. 2019;7:53040–53065. DOI: 10.1109/ACCESS.2019.2912200.

Al-Allak A, Bertelli G, Lewis P. Random forests: The new generation of machine learning algorithms to predict survival in breast cancer. International Journal of Surgery. 2013;11(8):607. DOI: 10.1016/j.ijsu.2013.06.112.

Natekin A, Knoll A. Gradient boosting machines, a tutorial. Frontiers in Neurorobotics. 2013;7:21. DOI: 10.3389/fnbot.2013.00021.

Chen Y-W, Song Q, Hu X. Techniques for automated machine learning. SIGKDD Explorations Newsletter. 2021;22(2):35–50. DOI: 10.1145/3447556.3447567.

Kozak A, et al. Deciphering AutoML ensembles: Cattleia’s assistance in decision-making. arXiv. 2024. arXiv:2403.12664. DOI: 10.48550/arXiv.2403.12664.

Balaji A, Allen A. Benchmarking automatic machine learning frameworks. arXiv. 2018. arXiv:1808.06492. DOI: 10.48550/arXiv.1808.06492.

Shen Z, et al. Automated machine learning: From principles to practices. arXiv. 2024. arXiv:1810.13306. DOI: 10.48550/arXiv.1810.13306.

Vabalas A, et al. Machine learning algorithm validation with a limited sample size. PLOS ONE. 2019;14(11):e0224365 . DOI: 10.1371/journal.pone.0224365.

Arora A, et al. Deep learning with H2O. H2O.ai; 2015.

Fallatah O, et al. Factors controlling groundwater radioactivity in arid environments: An automated machine learning approach. Science of The Total Environment. 2022;830:154707. DOI: 10.1016/j.scitotenv.2022.154707.

Yang R-M, et al. Comparison of boosted regression tree and random forest models for mapping topsoil organic carbon concentration in an alpine ecosystem. Ecological Indicators. 2016;60:870–878. DOI: 10.1016/j.ecolind.2015.08.036.

Lee CK, et al. Development and validation of a deep neural network model for prediction of postoperative in-hospital mortality. Anesthesiology. 2018;129(4):649–662. DOI: 10.1097/ALN.0000000000002186.

Boehmke B, Greenwell BM. Hands-on machine learning with R. New York: Chapman and Hall/CRC; 2019. DOI: 10.1201/9780367816377.

Aldhari I, et al. Severity prediction of highway crashes in Saudi Arabia using machine learning techniques. Applied Sciences. 2022;13(1):233. DOI: 10.3390/app13010233.

Angarita-Zapata JS, et al. A case study of AutoML for supervised crash severity prediction. In: Proceedings of the 19th World Congress of the International Fuzzy Systems Association (IFSA), 12th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT), and 11th International Summer School on Aggregation Operators (AGOP). Atlantis Press; 2021. p. 187–194. DOI: 10.2991/asum.k.210827.026.

Angarita-Zapata JS, Maestre-Gongora G, Calderín JF. A bibliometric analysis and benchmark of machine learning and AutoML in crash severity prediction: The case study of three Colombian cities. Sensors. 2021;21(24):8401. DOI: 10.3390/s21248401.

Toğan V, et al. Customized AutoML: An automated machine learning system for predicting severity of construction accidents. Buildings. 2022;12(11):1933. DOI: 10.3390/buildings12111933.

Mostafa SM, Salem SA, Habashy SM. Predictive model for accident severity. IAENG International Journal of Computer Science. 2022;49(1). https://www.iaeng.org/IJCS/issues_v49/issue_1/IJCS_49_1_13.pdf [Accessed 6th May 2024].

Downloads

Published

05-06-2025

How to Cite

ALANAZI, F., & SULEIMAN, A. (2025). Impact of Data Balancing and Feature Engineering on Accident Severity Models . Promet - Traffic&Transportation, 37(3), 665–690. https://doi.org/10.7307/ptt.v37i3.856

Issue

Section

Articles