Quality In / Quality Out: Assessing Data quality in an Anomaly Detection
Benchmark
- URL: http://arxiv.org/abs/2305.19770v1
- Date: Wed, 31 May 2023 12:03:12 GMT
- Title: Quality In / Quality Out: Assessing Data quality in an Anomaly Detection
Benchmark
- Authors: Jos\'e Camacho, Katarzyna Wasielewska, Marta Fuentes-Garc\'ia, Rafael
Rodr\'iguez-G\'omez
- Abstract summary: We show that relatively minor modifications on the same benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly detection) cause significantly more impact on model performance than the specific Machine Learning technique considered.
Our findings illustrate the need to devote more attention into (automatic) data quality assessment and optimization techniques in the context of autonomous networks.
- Score: 0.13764085113103217
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous or self-driving networks are expected to provide a solution to the
myriad of extremely demanding new applications in the Future Internet. The key
to handle complexity is to perform tasks like network optimization and failure
recovery with minimal human supervision. For this purpose, the community relies
on the development of new Machine Learning (ML) models and techniques. However,
ML can only be as good as the data it is fitted with. Datasets provided to the
community as benchmarks for research purposes, which have a relevant impact in
research findings and directions, are often assumed to be of good quality by
default. In this paper, we show that relatively minor modifications on the same
benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly
detection) cause significantly more impact on model performance than the
specific ML technique considered. To understand this finding, we contribute a
methodology to investigate the root causes for those differences, and to assess
the quality of the data labelling. Our findings illustrate the need to devote
more attention into (automatic) data quality assessment and optimization
techniques in the context of autonomous networks.
Related papers
- A Theoretical Framework for AI-driven data quality monitoring in high-volume data environments [1.2753215270475886]
This paper presents a theoretical framework for an AI-driven data quality monitoring system designed to address the challenges of maintaining data quality in high-volume environments.
We examine the limitations of traditional methods in managing the scale, velocity, and variety of big data and propose a conceptual approach leveraging advanced machine learning techniques.
Key components include an intelligent data ingestion layer, adaptive preprocessing mechanisms, context-aware feature extraction, and AI-based quality assessment modules.
arXiv Detail & Related papers (2024-10-11T07:06:36Z) - AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration [0.0]
This thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively.
Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment.
Thirdly, we present a generic framework for detecting various quality anomalies using AI models.
arXiv Detail & Related papers (2024-05-06T21:36:45Z) - A Novel Metric for Measuring Data Quality in Classification Applications
(extended version) [0.0]
We introduce and explain a novel metric to measure data quality.
This metric is based on the correlated evolution between the classification performance and the deterioration of data.
We provide an interpretation of each criterion and examples of assessment levels.
arXiv Detail & Related papers (2023-12-13T11:20:09Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - RLBoost: Boosting Supervised Models using Deep Reinforcement Learning [0.0]
We present RLBoost, an algorithm that uses deep reinforcement learning strategies to evaluate a particular dataset and obtain a model capable of estimating the quality of any new data.
The results of the article show that this model obtains better and more stable results than other state-of-the-art algorithms such as LOO, DataShapley or DVRL.
arXiv Detail & Related papers (2023-05-23T14:38:33Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Augmented Bilinear Network for Incremental Multi-Stock Time-Series
Classification [83.23129279407271]
We propose a method to efficiently retain the knowledge available in a neural network pre-trained on a set of securities.
In our method, the prior knowledge encoded in a pre-trained neural network is maintained by keeping existing connections fixed.
This knowledge is adjusted for the new securities by a set of augmented connections, which are optimized using the new data.
arXiv Detail & Related papers (2022-07-23T18:54:10Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Exploring the Efficacy of Automatically Generated Counterfactuals for
Sentiment Analysis [17.811597734603144]
We propose an approach to automatically generating counterfactual data for data augmentation and explanation.
A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance.
arXiv Detail & Related papers (2021-06-29T10:27:01Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.