Related papers: Towards a Rigorous Evaluation of Time-series Anomaly Detection

Towards a Rigorous Evaluation of Time-series Anomaly Detection

URL: http://arxiv.org/abs/2109.05257v1
Date: Sat, 11 Sep 2021 11:14:04 GMT
Title: Towards a Rigorous Evaluation of Time-series Anomaly Detection
Authors: Siwon Kim, Kukjin Choi, Hyun-Soo Choi, Byunghan Lee, and Sungroh Yoon
Abstract summary: In recent years, proposed studies on time-series anomaly detection (TAD) report high F1 scores on benchmark TAD datasets. Most studies apply a peculiar evaluation protocol called point adjustment (PA) before scoring. In this paper, we reveal that the PA protocol has a great possibility of overestimating the detection performance.
Score: 15.577148857778484
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In recent years, proposed studies on time-series anomaly detection (TAD) report high F1 scores on benchmark TAD datasets, giving the impression of clear improvements. However, most studies apply a peculiar evaluation protocol called point adjustment (PA) before scoring. In this paper, we theoretically and experimentally reveal that the PA protocol has a great possibility of overestimating the detection performance; that is, even a random anomaly score can easily turn into a state-of-the-art TAD method. Therefore, the comparison of TAD methods with F1 scores after the PA protocol can lead to misguided rankings. Furthermore, we question the potential of existing TAD methods by showing that an untrained model obtains comparable detection performance to the existing methods even without PA. Based on our findings, we propose a new baseline and an evaluation protocol. We expect that our study will help a rigorous evaluation of TAD and lead to further improvement in future researches.

Related papers

Practical Improvements of A/B Testing with Off-Policy Estimation [51.25970890274447]
We introduce a family of unbiased off-policy estimators that achieves lower variance than the standard approach.<n>Our theoretical analysis and experimental results validate the effectiveness and practicality of the proposed method.
arXiv Detail & Related papers (2025-06-12T13:11:01Z)
Rethinking Metrics and Benchmarks of Video Anomaly Detection [12.500876355560184]
Video Anomaly Detection (VAD) aims to detect anomalies that deviate from expectation.<n>In this paper, we rethink VAD evaluation protocols through comprehensive experimental analyses.<n>We propose three novel evaluation methods to address these limitations.
arXiv Detail & Related papers (2025-05-25T08:09:42Z)
Combining Query Performance Predictors: A Reproducibility Study [6.681467202699048]
As early as 2009, Hauff et al. [28] explored whether different QPP methods may be combined to improve prediction quality. This study revisits Hauff et al.s work to assess the extent of their findings in the light of new prediction methods, evaluation metrics, and datasets.
arXiv Detail & Related papers (2025-03-31T16:01:58Z)
Position: Quo Vadis, Unsupervised Time Series Anomaly Detection? [11.269007806012931]
The current state of machine learning scholarship in Timeseries Anomaly Detection (TAD) is plagued by the persistent use of flawed evaluation metrics. Our paper presents a critical analysis of the status quo in TAD, revealing the misleading track of current research.
arXiv Detail & Related papers (2024-05-04T14:43:31Z)
Model-free Test Time Adaptation for Out-Of-Distribution Detection [62.49795078366206]
We propose a Non-Parametric Test Time textbfAdaptation framework for textbfDistribution textbfDetection (abbr) abbr utilizes online test samples for model adaptation during testing, enhancing adaptability to changing data distributions. We demonstrate the effectiveness of abbr through comprehensive experiments on multiple OOD detection benchmarks.
arXiv Detail & Related papers (2023-11-28T02:00:47Z)
Beyond AUROC & co. for evaluating out-of-distribution detection performance [50.88341818412508]
Given their relevance for safe(r) AI, it is important to examine whether the basis for comparing OOD detection methods is consistent with practical needs. We propose a new metric - Area Under the Threshold Curve (AUTC), which explicitly penalizes poor separation between ID and OOD samples.
arXiv Detail & Related papers (2023-06-26T12:51:32Z)
On Pitfalls of Test-Time Adaptation [82.8392232222119]
Test-Time Adaptation (TTA) has emerged as a promising approach for tackling the robustness challenge under distribution shifts. We present TTAB, a test-time adaptation benchmark that encompasses ten state-of-the-art algorithms, a diverse array of distribution shifts, and two evaluation protocols.
arXiv Detail & Related papers (2023-06-06T09:35:29Z)
Evaluation Strategy of Time-series Anomaly Detection with Decay Function [1.713291434132985]
We propose a novel evaluation protocol called the Point-Adjusted protocol with decay function (PAdf) to evaluate the time-series anomaly detection algorithm. This paper theoretically and experimentally shows that the PAdf protocol solves the over- and under-estimation problems of existing protocols.
arXiv Detail & Related papers (2023-05-15T23:55:49Z)
A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts [143.14128737978342]
Test-time adaptation, an emerging paradigm, has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions. Recent progress in this paradigm highlights the significant benefits of utilizing unlabeled data for training self-adapted models prior to inference.
arXiv Detail & Related papers (2023-03-27T16:32:21Z)
A Semi-Bayesian Nonparametric Estimator of the Maximum Mean Discrepancy Measure: Applications in Goodness-of-Fit Testing and Generative Adversarial Networks [3.623570119514559]
We propose a semi-Bayesian nonparametric (semi-BNP) procedure for the goodness-of-fit (GOF) test. Our method introduces a novel Bayesian estimator for the maximum mean discrepancy (MMD) measure. We demonstrate that our proposed test outperforms frequentist MMD-based methods by achieving a lower false rejection and acceptance rate of the null hypothesis.
arXiv Detail & Related papers (2023-03-05T10:36:21Z)
Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT. We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z)
Boosting Out-of-Distribution Detection with Multiple Pre-trained Models [41.66566916581451]
Post hoc detection utilizing pre-trained models has shown promising performance and can be scaled to large-scale problems. We propose a detection enhancement method by ensembling multiple detection decisions derived from a zoo of pre-trained models. Our method substantially improves the relative performance by 65.40% and 26.96% on the CIFAR10 and ImageNet benchmarks.
arXiv Detail & Related papers (2022-12-24T12:11:38Z)
Manual Evaluation Matters: Reviewing Test Protocols of Distantly Supervised Relation Extraction [61.48964753725744]
We build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models. Results show that the manual evaluation can indicate very different conclusions from automatic ones.
arXiv Detail & Related papers (2021-05-20T06:55:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.