Data Smells in Public Datasets
- URL: http://arxiv.org/abs/2203.08007v1
- Date: Tue, 15 Mar 2022 15:44:20 GMT
- Title: Data Smells in Public Datasets
- Authors: Arumoy Shome and Luis Cruz and Arie van Deursen
- Abstract summary: We introduce a novel catalogue of data smells that can be used to indicate early signs of problems in machine learning systems.
To understand the prevalence of data quality issues in datasets, we analyse 25 public datasets and identify 14 data smells.
- Score: 7.1460275491017144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The adoption of Artificial Intelligence (AI) in high-stakes domains such as
healthcare, wildlife preservation, autonomous driving and criminal justice
system calls for a data-centric approach to AI. Data scientists spend the
majority of their time studying and wrangling the data, yet tools to aid them
with data analysis are lacking. This study identifies the recurrent data
quality issues in public datasets. Analogous to code smells, we introduce a
novel catalogue of data smells that can be used to indicate early signs of
problems or technical debt in machine learning systems. To understand the
prevalence of data quality issues in datasets, we analyse 25 public datasets
and identify 14 data smells.
Related papers
- Network Intrusion Datasets: A Survey, Limitations, and Recommendations [0.0]
Data-driven cyberthreat detection has become a crucial defense technique in modern cybersecurity.
Despite its importance, data scarcity has long been recognized as a major obstacle in NIDS research.
arXiv Detail & Related papers (2025-02-10T17:14:37Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Predicting Seriousness of Injury in a Traffic Accident: A New Imbalanced
Dataset and Benchmark [62.997667081978825]
The paper introduces a new dataset to assess the performance of machine learning algorithms in the prediction of the seriousness of injury in a traffic accident.
The dataset is created by aggregating publicly available datasets from the UK Department for Transport.
arXiv Detail & Related papers (2022-05-20T21:15:26Z) - Enabling Synthetic Data adoption in regulated domains [1.9512796489908306]
The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms.
In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for.
A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties.
arXiv Detail & Related papers (2022-04-13T10:53:54Z) - Data Smells: Categories, Causes and Consequences, and Detection of
Suspicious Data in AI-based Systems [3.793596705511303]
Article conceptualizes data smells and elaborates on their causes, consequences, detection, and use in the context of AI-based systems.
In addition, a catalogue of 36 data smells divided into three categories (i.e., Believability Smells, Understandability Smells, Consistency Smells) is presented.
arXiv Detail & Related papers (2022-03-19T19:21:52Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Occams Razor for Big Data? On Detecting Quality in Large Unstructured
Datasets [0.0]
New trend towards analytic complexity represents a severe challenge for the principle of parsimony or Occams Razor in science.
Computational building block approaches for data clustering can help to deal with large unstructured datasets in minimized computation time.
The review concludes on how cultural differences between East and West are likely to affect the course of big data analytics.
arXiv Detail & Related papers (2020-11-12T16:06:01Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.