CCE: Confidence-Consistency Evaluation for Time Series Anomaly Detection
- URL: http://arxiv.org/abs/2509.01098v1
- Date: Mon, 01 Sep 2025 03:38:38 GMT
- Title: CCE: Confidence-Consistency Evaluation for Time Series Anomaly Detection
- Authors: Zhijie Zhong, Zhiwen Yu, Yiu-ming Cheung, Kaixiang Yang,
- Abstract summary: We introduce Confidence-Consistency Evaluation (CCE), a novel evaluation metric.<n>CCE simultaneously measures prediction confidence and uncertainty consistency.<n>We also establish RankEval, a benchmark for comparing the ranking capabilities of various metrics.
- Score: 56.302586730134806
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Time Series Anomaly Detection metrics serve as crucial tools for model evaluation. However, existing metrics suffer from several limitations: insufficient discriminative power, strong hyperparameter dependency, sensitivity to perturbations, and high computational overhead. This paper introduces Confidence-Consistency Evaluation (CCE), a novel evaluation metric that simultaneously measures prediction confidence and uncertainty consistency. By employing Bayesian estimation to quantify the uncertainty of anomaly scores, we construct both global and event-level confidence and consistency scores for model predictions, resulting in a concise CCE metric. Theoretically and experimentally, we demonstrate that CCE possesses strict boundedness, Lipschitz robustness against score perturbations, and linear time complexity $\mathcal{O}(n)$. Furthermore, we establish RankEval, a benchmark for comparing the ranking capabilities of various metrics. RankEval represents the first standardized and reproducible evaluation pipeline that enables objective comparison of evaluation metrics. Both CCE and RankEval implementations are fully open-source.
Related papers
- Signature-Kernel Based Evaluation Metrics for Robust Probabilistic and Tail-Event Forecasting [12.03452228384043]
Probabilistic forecasting is increasingly critical across high-stakes domains, from finance and epidemiology to climate science.<n>Current evaluation frameworks lack a consensus metric and suffer from two critical flaws.<n>We propose two kernel-based metrics: the signature maximum mean discrepancy (Sig-MMD) and our novel censored Sig-MMD.
arXiv Detail & Related papers (2026-02-10T19:00:00Z) - Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations [49.84786015324238]
Confidence estimation (CE) indicates how reliable the answers of large language models (LLMs) are, and can impact user trust and decision-making.<n>We present a comprehensive evaluation framework for CE that measures their confidence quality on three new aspects.<n>These include robustness of confidence against prompt perturbations, stability across semantic equivalent answers, and sensitivity to semantically different answers.
arXiv Detail & Related papers (2026-01-12T23:16:50Z) - Open-World Deepfake Attribution via Confidence-Aware Asymmetric Learning [78.92934995292113]
We propose a Confidence-Aware Asymmetric Learning (CAL) framework, which balances confidence across known and novel forgery types.<n>CAL consistently outperforms previous methods, achieving new state-of-the-art performance on both known and novel forgery attribution.
arXiv Detail & Related papers (2025-12-14T12:31:28Z) - Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation [63.49409574310576]
Large language models (LLMs) exhibit overconfidence, assigning high confidence scores to incorrect predictions.<n>We introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation.<n>Our code and all baselines used in the paper are available on GitHub.
arXiv Detail & Related papers (2025-08-16T13:29:35Z) - Trust, or Don't Predict: Introducing the CWSA Family for Confidence-Aware Model Evaluation [0.0]
We introduce two new metrics Confidence-Weighted Selective Accuracy (CWSA) and its normalized variant CWSA+.<n>CWSA offers principled and interpretable way to evaluate predictive models under confidence thresholds.<n>We show that CWSA and CWSA+ both effectively detect nuanced failure modes and outperform classical metrics in trust-sensitive tests.
arXiv Detail & Related papers (2025-05-24T10:07:48Z) - MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels [16.300463494913593]
Large Language Models (LLMs) require robust confidence estimation.<n>McQCA-Eval is an evaluation framework for assessing confidence measures in Natural Language Generation.
arXiv Detail & Related papers (2025-02-20T05:09:29Z) - The Certainty Ratio $C_ρ$: a novel metric for assessing the reliability of classifier predictions [0.0]
This paper introduces the Certainty Ratio ($C_rho$), a novel metric designed to quantify the contribution of confident (certain) versus uncertain predictions to any classification performance measure.<n> Experimental results across 21 datasets and multiple classifiers, including Decision Trees, Naive-Bayes, 3-Nearest Neighbors, and Random Forests, demonstrate that $C_rho$rho reveals critical insights that conventional metrics often overlook.
arXiv Detail & Related papers (2024-11-04T10:50:03Z) - Accurate and Reliable Confidence Estimation Based on Non-Autoregressive
End-to-End Speech Recognition System [42.569506907182706]
Previous end-to-end(E2E) based confidence estimation models (CEM) predict score sequences of equal length with input transcriptions, leading to unreliable estimation when deletion and insertion errors occur.
We propose CIF-Aligned confidence estimation model (CA-CEM) to achieve accurate and reliable confidence estimation based on novel non-autoregressive E2E ASR model - Paraformer.
arXiv Detail & Related papers (2023-05-18T03:34:50Z) - Evaluating Probabilistic Classifiers: The Triptych [62.997667081978825]
We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance.
The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value.
arXiv Detail & Related papers (2023-01-25T19:35:23Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.