Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection
- URL: http://arxiv.org/abs/2412.05240v1
- Date: Fri, 06 Dec 2024 18:18:26 GMT
- Title: Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection
- Authors: Qiaolin Qin, Heng Li, Ettore Merlo, Maxime Lamothe,
- Abstract summary: RIOLU is fully automated, automatically parameterized, and does not need labeled samples.
RIOLU can generate precise patterns from datasets in various domains, with a high F1 score of 97.2%.
A variant of RIOLU, with user guidance, can further boost its precision, with up to 37.4% improvement in terms of F1.
- Score: 6.454528834218153
- License:
- Abstract: With the advent of data-centric and machine learning (ML) systems, data quality is playing an increasingly critical role in ensuring the overall quality of software systems. Data preparation, an essential step towards high data quality, is known to be a highly effort-intensive process. Although prior studies have dealt with one of the most impacting issues, data pattern violations, these studies usually require data-specific configurations (i.e., parameterized) or use carefully curated data as learning examples (i.e., supervised), relying on domain knowledge and deep understanding of the data, or demanding significant manual effort. In this paper, we introduce RIOLU: Regex Inferencer auto-parameterized Learning with Uncleaned data. RIOLU is fully automated, automatically parameterized, and does not need labeled samples. RIOLU can generate precise patterns from datasets in various domains, with a high F1 score of 97.2%, exceeding the state-of-the-art baseline. In addition, according to our experiment on five datasets with anomalies, RIOLU can automatically estimate a data column's error rate, draw normal patterns, and predict anomalies from unlabeled data with higher performance (up to 800.4% improvement in terms of F1) than the state-of-the-art baseline, even outperforming ChatGPT in terms of both accuracy (12.3% higher F1) and efficiency (10% less inference time). A variant of RIOLU, with user guidance, can further boost its precision, with up to 37.4% improvement in terms of F1. Our evaluation in an industrial setting further demonstrates the practical benefits of RIOLU.
Related papers
- Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities.
Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - RLBoost: Boosting Supervised Models using Deep Reinforcement Learning [0.0]
We present RLBoost, an algorithm that uses deep reinforcement learning strategies to evaluate a particular dataset and obtain a model capable of estimating the quality of any new data.
The results of the article show that this model obtains better and more stable results than other state-of-the-art algorithms such as LOO, DataShapley or DVRL.
arXiv Detail & Related papers (2023-05-23T14:38:33Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - A Dataset Fusion Algorithm for Generalised Anomaly Detection in
Homogeneous Periodic Time Series Datasets [0.0]
"Dataset Fusion" is an algorithm for fusing periodic signals from multiple homogeneous datasets into a single dataset.
The proposed approach significantly outperforms conventional training approaches with an Average F1 score of 0.879.
Results show that using only 6.25% of the training data, translating to a 93.7% reduction in computational power, results in a mere 4.04% decrease in performance.
arXiv Detail & Related papers (2023-05-14T16:24:09Z) - Systematic Evaluation of Deep Learning Models for Log-based Failure Prediction [3.3810628880631226]
This paper systematically investigates the combination of log data embedding strategies and Deep Learning (DL) types for failure prediction.
To that end, we propose a modular architecture to accommodate various configurations of embedding strategies and DL-based encoders.
Using the F1 score metric, our results show that the best overall performing configuration is a CNN-based encoder with Logkey2vec.
arXiv Detail & Related papers (2023-03-13T16:04:14Z) - RLAD: Time Series Anomaly Detection through Reinforcement Learning and
Active Learning [17.089402177923297]
We introduce a new semi-supervised, time series anomaly detection algorithm.
It uses deep reinforcement learning and active learning to efficiently learn and adapt to anomalies in real-world time series data.
It requires no manual tuning of parameters and outperforms all state-of-art methods we compare with.
arXiv Detail & Related papers (2021-03-31T15:21:15Z) - AutoDO: Robust AutoAugment for Biased Data with Label Noise via Scalable
Probabilistic Implicit Differentiation [3.118384520557952]
AutoAugment has sparked an interest in automated augmentation methods for deep learning models.
We show that those methods are not robust when applied to biased and noisy data.
We reformulate AutoAugment as a generalized automated dataset optimization (AutoDO) task.
Our experiments show up to 9.3% improvement for biased datasets with label noise compared to prior methods.
arXiv Detail & Related papers (2021-03-10T04:05:33Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.