Predicting Seriousness of Injury in a Traffic Accident: A New Imbalanced
Dataset and Benchmark
- URL: http://arxiv.org/abs/2205.10441v1
- Date: Fri, 20 May 2022 21:15:26 GMT
- Title: Predicting Seriousness of Injury in a Traffic Accident: A New Imbalanced
Dataset and Benchmark
- Authors: Paschalis Lagias, George D. Magoulas, Ylli Prifti and Alessandro
Provetti
- Abstract summary: The paper introduces a new dataset to assess the performance of machine learning algorithms in the prediction of the seriousness of injury in a traffic accident.
The dataset is created by aggregating publicly available datasets from the UK Department for Transport.
- Score: 62.997667081978825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The paper introduces a new dataset to assess the performance of machine
learning algorithms in the prediction of the seriousness of injury in a traffic
accident. The dataset is created by aggregating publicly available datasets
from the UK Department for Transport, which are drastically imbalanced with
missing attributes sometimes approaching 50\% of the overall data
dimensionality. The paper presents the data analysis pipeline starting from the
publicly available data of road traffic accidents and ending with predictors of
possible injuries and their degree of severity. It addresses the huge
incompleteness of public data with a MissForest model. The paper also
introduces two baseline approaches to create injury predictors: a supervised
artificial neural network and a reinforcement learning model. The dataset can
potentially stimulate diverse aspects of machine learning research on
imbalanced datasets and the two approaches can be used as baseline references
when researchers test more advanced learning algorithms in this area.
Related papers
- Towards Assessing Data Bias in Clinical Trials [0.0]
Health care datasets can still be affected by data bias.
Data bias provides a distorted view of reality, leading to wrong analysis results and, consequently, decisions.
This paper proposes a method to address bias in datasets that: (i) defines the types of data bias that may be present in the dataset, (ii) characterizes and quantifies data bias with adequate metrics, and (iii) provides guidelines to identify, measure, and mitigate data bias for different data sources.
arXiv Detail & Related papers (2022-12-19T17:10:06Z) - D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling
Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases.
A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network.
For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z) - Deeply-Learned Generalized Linear Models with Missing Data [6.302686933168439]
We provide a formal treatment of missing data in the context of deeply learned generalized linear models.
We propose a new architecture, textitdlglm, that is able to flexibly account for both ignorable and non-ignorable patterns of missingness.
We conclude with a case study of a Bank Marketing dataset from the UCI Machine Learning Repository.
arXiv Detail & Related papers (2022-07-18T20:00:13Z) - Resilient Neural Forecasting Systems [10.709321760368137]
Industrial machine learning systems face data challenges that are often under-explored in the academic literature.
In this paper, we discuss data challenges and solutions in the context of a Neural Forecasting application on labor planning.
We address changes in data distribution with a periodic retraining scheme and discuss the critical importance of model stability in this setting.
arXiv Detail & Related papers (2022-03-16T09:37:49Z) - TRAPDOOR: Repurposing backdoors to detect dataset bias in machine
learning-based genomic analysis [15.483078145498085]
Under-representation of groups in datasets can lead to inaccurate predictions for certain groups, which can exacerbate systemic discrimination issues.
We propose TRAPDOOR, a methodology for identification of biased datasets by repurposing a technique that has been mostly proposed for nefarious purposes: Neural network backdoors.
Using a real-world cancer dataset, we analyze the dataset with the bias that already existed towards white individuals and also introduced biases in datasets artificially.
arXiv Detail & Related papers (2021-08-14T17:02:02Z) - A model for traffic incident prediction using emergency braking data [77.34726150561087]
We address the fundamental problem of data scarcity in road traffic accident prediction by training our model on emergency braking events instead of accidents.
We present a prototype implementing a traffic incident prediction model for Germany based on emergency braking data from Mercedes-Benz vehicles.
arXiv Detail & Related papers (2021-02-12T18:17:12Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z) - Dropout: Explicit Forms and Capacity Control [57.36692251815882]
We investigate capacity control provided by dropout in various machine learning problems.
In deep learning, we show that the data-dependent regularizer due to dropout directly controls the Rademacher complexity of the underlying class of deep neural networks.
We evaluate our theoretical findings on real-world datasets, including MovieLens, MNIST, and Fashion-MNIST.
arXiv Detail & Related papers (2020-03-06T19:10:15Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.