MechDetect: Detecting Data-Dependent Errors
- URL: http://arxiv.org/abs/2512.04138v1
- Date: Wed, 03 Dec 2025 15:34:16 GMT
- Title: MechDetect: Detecting Data-Dependent Errors
- Authors: Philipp Jung, Nicholas Chandler, Sebastian Jäger, Felix Biessmann,
- Abstract summary: Knowing how errors were generated can be key to tracing and fixing them.<n>We propose MechDetect, a simple algorithm to investigate error generation mechanisms.<n>We demonstrate the effectiveness of MechDetect in experiments on established benchmark datasets.
- Score: 1.2722697496405464
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Data quality monitoring is a core challenge in modern information processing systems. While many approaches to detect data errors or shifts have been proposed, few studies investigate the mechanisms governing error generation. We argue that knowing how errors were generated can be key to tracing and fixing them. In this study, we build on existing work in the statistics literature on missing values and propose MechDetect, a simple algorithm to investigate error generation mechanisms. Given a tabular data set and a corresponding error mask, the algorithm estimates whether or not the errors depend on the data using machine learning models. Our work extends established approaches to detect mechanisms underlying missing values and can be readily applied to other error types, provided that an error mask is available. We demonstrate the effectiveness of MechDetect in experiments on established benchmark datasets.
Related papers
- Towards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models [15.985949745494747]
TableEG is a framework that leverages large language models to generate authentic errors.<n>We trained on 12 real-world datasets spanning 10 diverse domains.<n>TableEG not only bridges the gap between synthetic and real-world errors but also establishes a robust benchmark for subsequent error detection and correction tasks.
arXiv Detail & Related papers (2025-07-15T02:58:25Z) - A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task.
We introduce a novel approach based on error detector-corrector framework.
Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z) - TeVAE: A Variational Autoencoder Approach for Discrete Online Anomaly Detection in Variable-state Multivariate Time-series Data [0.017476232824732776]
We propose a temporal variational autoencoder (TeVAE) that can detect anomalies with minimal false positives when trained on unlabelled data.
When properly configured, TeVAE flags anomalies only 6% of the time wrongly and detects 65% of anomalies present.
arXiv Detail & Related papers (2024-07-09T13:32:33Z) - WePaMaDM-Outlier Detection: Weighted Outlier Detection using Pattern
Approaches for Mass Data Mining [0.6754597324022876]
Outlier detection can reveal vital information about system faults, fraudulent activities, and patterns in the data.
This article proposed the WePaMaDM-Outlier Detection with distinct mass data mining domain.
It also investigates the significance of data modeling in outlier detection techniques in surveillance, fault detection, and trend analysis.
arXiv Detail & Related papers (2023-06-09T07:00:00Z) - A Robust and Explainable Data-Driven Anomaly Detection Approach For
Power Electronics [56.86150790999639]
We present two anomaly detection and classification approaches, namely the Matrix Profile algorithm and anomaly transformer.
The Matrix Profile algorithm is shown to be well suited as a generalizable approach for detecting real-time anomalies in streaming time-series data.
A series of custom filters is created and added to the detector to tune its sensitivity, recall, and detection accuracy.
arXiv Detail & Related papers (2022-09-23T06:09:35Z) - An Outlier Exposure Approach to Improve Visual Anomaly Detection
Performance for Mobile Robots [76.36017224414523]
We consider the problem of building visual anomaly detection systems for mobile robots.
Standard anomaly detection models are trained using large datasets composed only of non-anomalous data.
We tackle the problem of exploiting these data to improve the performance of a Real-NVP anomaly detection model.
arXiv Detail & Related papers (2022-09-20T15:18:13Z) - Self-Supervised Training with Autoencoders for Visual Anomaly Detection [61.62861063776813]
We focus on a specific use case in anomaly detection where the distribution of normal samples is supported by a lower-dimensional manifold.
We adapt a self-supervised learning regime that exploits discriminative information during training but focuses on the submanifold of normal examples.
We achieve a new state-of-the-art result on the MVTec AD dataset -- a challenging benchmark for visual anomaly detection in the manufacturing domain.
arXiv Detail & Related papers (2022-06-23T14:16:30Z) - Understanding Factual Errors in Summarization: Errors, Summarizers,
Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model.
We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z) - Learn then Test: Calibrating Predictive Algorithms to Achieve Risk
Control [67.52000805944924]
Learn then Test (LTT) is a framework for calibrating machine learning models.
Our main insight is to reframe the risk-control problem as multiple hypothesis testing.
We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision.
arXiv Detail & Related papers (2021-10-03T17:42:03Z) - DAE : Discriminatory Auto-Encoder for multivariate time-series anomaly
detection in air transportation [68.8204255655161]
We propose a novel anomaly detection model called Discriminatory Auto-Encoder (DAE)
It uses the baseline of a regular LSTM-based auto-encoder but with several decoders, each getting data of a specific flight phase.
Results show that the DAE achieves better results in both accuracy and speed of detection.
arXiv Detail & Related papers (2021-09-08T14:07:55Z) - Detecting Faults during Automatic Screwdriving: A Dataset and Use Case
of Anomaly Detection for Automatic Screwdriving [80.6725125503521]
Data-driven approaches, using Machine Learning (ML) for detecting faults have recently gained increasing interest.
We present a use case of using ML models for detecting faults during automated screwdriving operations.
arXiv Detail & Related papers (2021-07-05T11:46:00Z) - Uncertainty for Identifying Open-Set Errors in Visual Object Detection [31.533136658421892]
GMM-Det is a real-time method for extracting uncertainty from object detectors to identify and reject open-set errors.
We show that GMM-Det consistently outperforms existing uncertainty techniques for identifying and rejecting open-set detections.
arXiv Detail & Related papers (2021-04-03T07:12:31Z) - Out-Of-Bag Anomaly Detection [0.9449650062296822]
Data anomalies are ubiquitous in real world datasets, and can have an adverse impact on machine learning (ML) systems.
We propose a novel model-based anomaly detection method, that we call Out-of-Bag anomaly detection.
We show our method can improve the accuracy and reliability of an ML system as data pre-processing step via a case study on home valuation.
arXiv Detail & Related papers (2020-09-20T06:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.