Impact of Data Balancing and Feature Engineering on Accident Severity Models

accident severity traffic safety machine learning feature engineering data balancing ADASYN AutoML

Authors

  • Fayez ALANAZI
    fkalanazi@ju.edu.sa
    Jouf University, College of Engineering, Civil Engineering Department, Saudi Arabia
  • Aminu SULEIMAN Bayero University Kano, Faculty of Engineering, Department of Civil Engineering, Nigeria

Downloads

This study investigates the impacts of feature engineering techniques, including Clustering, Target Encoding and Anomaly Detection, in conjunction with data balancing methods, on the efficacy of machine learning models for predicting road accident severity. Automated Machine Learning (AutoML), Distributed Random Forest (DRF), Boosted Regression Trees (BRT) and Deep Learning models were evaluated on datasets that were balanced using the SMOTE (Synthetic Minority Over-Sampling Technique) and ADASYN (Adaptive Synthetic Sampling) techniques. Evaluation metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Log Loss, Area under the Curve (AUC), and Area under the Precision-Recall Curve (AUCPR) are employed. Results reveal that the AutoML consistently outperforms other models, achieving an 85% accuracy in predicting fatal accidents and 94% accuracy in predicting injuries. Deep Learning excels in injury accident prediction, with a 95% accuracy, but faces challenges with fatalities, achieving a 60% accuracy. The study underscores the critical role of feature engineering techniques and data balancing methods in enhancing predictive accuracy for accident severity classification. Specifically, the incorporation of Clustering, Target Encoding and Anomaly Detection techniques alongside SMOTE and ADASYN balancing methods significantly improves the model performance. Further refinement and validation are crucial for optimising model performance in real-world traffic safety management applications.