Data Smells: Categories, Causes and Consequences, and Detection of
Suspicious Data in AI-based Systems
- URL: http://arxiv.org/abs/2203.10384v1
- Date: Sat, 19 Mar 2022 19:21:52 GMT
- Title: Data Smells: Categories, Causes and Consequences, and Detection of
Suspicious Data in AI-based Systems
- Authors: Harald Foidl, Michael Felderer, Rudolf Ramler
- Abstract summary: Article conceptualizes data smells and elaborates on their causes, consequences, detection, and use in the context of AI-based systems.
In addition, a catalogue of 36 data smells divided into three categories (i.e., Believability Smells, Understandability Smells, Consistency Smells) is presented.
- Score: 3.793596705511303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High data quality is fundamental for today's AI-based systems. However,
although data quality has been an object of research for decades, there is a
clear lack of research on potential data quality issues (e.g., ambiguous,
extraneous values). These kinds of issues are latent in nature and thus often
not obvious. Nevertheless, they can be associated with an increased risk of
future problems in AI-based systems (e.g., technical debt, data-induced
faults). As a counterpart to code smells in software engineering, we refer to
such issues as Data Smells. This article conceptualizes data smells and
elaborates on their causes, consequences, detection, and use in the context of
AI-based systems. In addition, a catalogue of 36 data smells divided into three
categories (i.e., Believability Smells, Understandability Smells, Consistency
Smells) is presented. Moreover, the article outlines tool support for detecting
data smells and presents the result of an initial smell detection on more than
240 real-world datasets.
Related papers
- Towards Understanding the Impact of Data Bugs on Deep Learning Models in Software Engineering [13.17302533571231]
Deep learning (DL) systems are prone to bugs from many sources, including training data.
Existing literature suggests that bugs in training data are highly prevalent.
We investigate three types of data prevalent in software engineering tasks: code-based, text-based, and metric-based.
arXiv Detail & Related papers (2024-11-19T00:28:20Z) - Data Issues in Industrial AI System: A Meta-Review and Research Strategy [10.540603300770885]
Artificial intelligence (AI) is assuming an increasingly pivotal role within industrial systems.
Despite the recent trend within various industries to adopt AI, the actual adoption of AI is not as developed as perceived.
How to address these data issues stands as a significant concern confronting both industry and academia.
arXiv Detail & Related papers (2024-06-22T08:36:59Z) - AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration [0.0]
This thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively.
Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment.
Thirdly, we present a generic framework for detecting various quality anomalies using AI models.
arXiv Detail & Related papers (2024-05-06T21:36:45Z) - On some elusive aspects of databases hindering AI based discovery: A
case study on superconducting materials [0.0]
We discuss three aspects, namely intrinsically biased sample selection, possible hidden variables, disparate data age.
To our knowledge, we suggest and test a first strategy capable of detecting and quantifying the presence of the intrinsic data bias.
arXiv Detail & Related papers (2023-11-16T13:38:00Z) - A Discrepancy Aware Framework for Robust Anomaly Detection [51.710249807397695]
We present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies.
Our method leverages an appearance-agnostic cue to guide the decoder in identifying defects, thereby alleviating its reliance on synthetic appearance.
Under the simple synthesis strategies, it outperforms existing methods by a large margin. Furthermore, it also achieves the state-of-the-art localization performance.
arXiv Detail & Related papers (2023-10-11T15:21:40Z) - Advanced Data Augmentation Approaches: A Comprehensive Survey and Future
directions [57.30984060215482]
We provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique.
We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2023-01-07T11:37:32Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Data Smells in Public Datasets [7.1460275491017144]
We introduce a novel catalogue of data smells that can be used to indicate early signs of problems in machine learning systems.
To understand the prevalence of data quality issues in datasets, we analyse 25 public datasets and identify 14 data smells.
arXiv Detail & Related papers (2022-03-15T15:44:20Z) - Federated Causal Discovery [74.37739054932733]
This paper develops a gradient-based learning framework named DAG-Shared Federated Causal Discovery (DS-FCD)
It can learn the causal graph without directly touching local data and naturally handle the data heterogeneity.
Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method.
arXiv Detail & Related papers (2021-12-07T08:04:12Z) - DAE : Discriminatory Auto-Encoder for multivariate time-series anomaly
detection in air transportation [68.8204255655161]
We propose a novel anomaly detection model called Discriminatory Auto-Encoder (DAE)
It uses the baseline of a regular LSTM-based auto-encoder but with several decoders, each getting data of a specific flight phase.
Results show that the DAE achieves better results in both accuracy and speed of detection.
arXiv Detail & Related papers (2021-09-08T14:07:55Z) - Data Mining with Big Data in Intrusion Detection Systems: A Systematic
Literature Review [68.15472610671748]
Cloud computing has become a powerful and indispensable technology for complex, high performance and scalable computation.
The rapid rate and volume of data creation has begun to pose significant challenges for data management and security.
The design and deployment of intrusion detection systems (IDS) in the big data setting has, therefore, become a topic of importance.
arXiv Detail & Related papers (2020-05-23T20:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.