Related papers: A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data

A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data

URL: http://arxiv.org/abs/2404.02187v1
Date: Tue, 2 Apr 2024 16:07:27 GMT
Title: A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data
Authors: Junlan Chen, Ziyuan Pu, Nan Zheng, Xiao Wen, Hongliang Ding, Xiucheng Guo,
Abstract summary: This study proposes a crash data generation method based on Conditional Tabular GAN. A crash severity model is employed to estimate the performance of classification and interpretation. The results indicate that using synthetic data generated by CTGAN-RU for crash severity modeling outperforms original data or synthetic data generated by other resampling methods.
Score: 6.169163527464771
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Crash data is often greatly imbalanced, with the majority of crashes being non-fatal crashes, and only a small number being fatal crashes due to their rarity. Such data imbalance issue poses a challenge for crash severity modeling since it struggles to fit and interpret fatal crash outcomes with very limited samples. Usually, such data imbalance issues are addressed by data resampling methods, such as under-sampling and over-sampling techniques. However, most traditional and deep learning-based data resampling methods, such as synthetic minority oversampling technique (SMOTE) and generative Adversarial Networks (GAN) are designed dedicated to processing continuous variables. Though some resampling methods have improved to handle both continuous and discrete variables, they may have difficulties in dealing with the collapse issue associated with sparse discrete risk factors. Moreover, there is a lack of comprehensive studies that compare the performance of various resampling methods in crash severity modeling. To address the aforementioned issues, the current study proposes a crash data generation method based on the Conditional Tabular GAN. After data balancing, a crash severity model is employed to estimate the performance of classification and interpretation. A comparative study is conducted to assess classification accuracy and distribution consistency of the proposed generation method using a 4-year imbalanced crash dataset collected in Washington State, U.S. Additionally, Monte Carlo simulation is employed to estimate the performance of parameter and probability estimation in both two- and three-class imbalance scenarios. The results indicate that using synthetic data generated by CTGAN-RU for crash severity modeling outperforms using original data or synthetic data generated by other resampling methods.

Related papers

Modeling of AUV Dynamics with Limited Resources: Efficient Online Learning Using Uncertainty [9.176056742068814]
This work investigates the use of uncertainty in the selection of data points to rehearse in online learning when storage capacity is constrained. We present three novel approaches: the Threshold method, which excludes samples with uncertainty below a specified threshold, the Greedy method, designed to maximize uncertainty among the stored points, and Threshold-Greedy, which combines the previous two approaches.
arXiv Detail & Related papers (2025-04-06T18:48:55Z)
Spatiotemporal Prediction of Secondary Crashes by Rebalancing Dynamic and Static Data with Generative Adversarial Networks [6.571659350175123]
Secondary crashes significantly exacerbate traffic congestion and increase the severity of incidents. Existing methods fail to fully address the complexity of traffic crash data, particularly the coexistence of dynamic and static features. This study proposes a hybrid model named VarFusiGAN-Transformer, aimed at improving the fidelity of secondary crash data generation.
arXiv Detail & Related papers (2025-01-17T08:56:49Z)
Enhancing Crash Frequency Modeling Based on Augmented Multi-Type Data by Hybrid VAE-Diffusion-Based Generative Neural Networks [13.402051372401822]
A key challenge in crash frequency modelling is the prevalence of excessive zero observations. We propose a hybrid VAE-Diffusion neural network, designed to reduce zero observations. We assess the synthetic data quality generated by this model through metrics like similarity, accuracy, diversity, and structural consistency.
arXiv Detail & Related papers (2025-01-17T07:53:27Z)
Crash Severity Risk Modeling Strategies under Data Imbalance [7.9613232032536745]
This study investigates crash severity risk modeling strategies for work zones involving large vehicles when there are crash data imbalance between low-severity (LS) and high-severity (HS) crashes. We utilized crash data, involving large vehicles in South Carolina work zones for the period between 2014 and 2018, which included 4 times more LS crashes compared to HS crashes. The findings of this study highlight a disparity between LS and HS predictions, with less-accurate prediction of HS crashes compared to LS crashes due to class imbalance and feature overlaps between LS and HS crashes.
arXiv Detail & Related papers (2024-12-03T02:28:35Z)
Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models. We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z)
Sample, estimate, aggregate: A recipe for causal discovery foundation models [28.116832159265964]
We train a supervised model that learns to predict a larger causal graph from the outputs of classical causal discovery algorithms run over subsets of variables. Our approach is enabled by the observation that typical errors in the outputs of classical methods remain comparable across datasets. Experiments on real and synthetic data demonstrate that this model maintains high accuracy in the face of misspecification or distribution shift.
arXiv Detail & Related papers (2024-02-02T21:57:58Z)
SCME: A Self-Contrastive Method for Data-free and Query-Limited Model Extraction Attack [18.998300969035885]
Model extraction attacks fool the target model by generating adversarial examples on a substitute model. We propose a novel data-free model extraction method named SCME, which considers both the inter- and intra-class diversity in synthesizing fake data.
arXiv Detail & Related papers (2023-10-15T10:41:45Z)
Improving the Robustness of Summarization Models by Detecting and Removing Input Noise [50.27105057899601]
We present a large empirical study quantifying the sometimes severe loss in performance from different types of input noise for a range of datasets and model sizes. We propose a light-weight method for detecting and removing such noise in the input during model inference without requiring any training, auxiliary models, or even prior knowledge of the type of noise.
arXiv Detail & Related papers (2022-12-20T00:33:11Z)
Effective Class-Imbalance learning based on SMOTE and Convolutional Neural Networks [0.1074267520911262]
Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results. In this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions.
arXiv Detail & Related papers (2022-09-01T07:42:16Z)
Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective. Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z)
Towards Synthetic Multivariate Time Series Generation for Flare Forecasting [5.098461305284216]
One of the limiting factors in training data-driven, rare-event prediction algorithms is the scarcity of the events of interest. In this study, we explore the usefulness of the conditional generative adversarial network (CGAN) as a means to perform data-informed oversampling.
arXiv Detail & Related papers (2021-05-16T22:23:23Z)
Oversampling Adversarial Network for Class-Imbalanced Fault Diagnosis [12.526197448825968]
Class-imbalance problem requires a robust learning system which can timely predict and classify the data. We propose a new adversarial network for simultaneous classification and fault detection.
arXiv Detail & Related papers (2020-08-07T10:12:07Z)
Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation [51.091890311312085]
We propose a new training scheme for auto-regressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation. Our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.
arXiv Detail & Related papers (2020-07-12T15:31:24Z)
Learning while Respecting Privacy and Robustness to Distributional Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model. The objective is to endow the trained model with robustness against adversarially manipulated input data. Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
Instability, Computational Efficiency and Statistical Accuracy [101.32305022521024]
We develop a framework that yields statistical accuracy based on interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (instability) when applied to an empirical object based on $n$ samples. We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models.
arXiv Detail & Related papers (2020-05-22T22:30:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.