Navigating the Metric Maze: A Taxonomy of Evaluation Metrics for Anomaly
Detection in Time Series
- URL: http://arxiv.org/abs/2303.01272v1
- Date: Thu, 2 Mar 2023 13:58:06 GMT
- Title: Navigating the Metric Maze: A Taxonomy of Evaluation Metrics for Anomaly
Detection in Time Series
- Authors: Sondre S{\o}rb{\o} and Massimiliano Ruocco
- Abstract summary: This paper provides a comprehensive overview of the metrics used for the evaluation of time series anomaly detection methods.
Twenty metrics are analyzed and discussed in detail, highlighting the unique suitability of each for specific tasks.
- Score: 0.456877715768796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field of time series anomaly detection is constantly advancing, with
several methods available, making it a challenge to determine the most
appropriate method for a specific domain. The evaluation of these methods is
facilitated by the use of metrics, which vary widely in their properties.
Despite the existence of new evaluation metrics, there is limited agreement on
which metrics are best suited for specific scenarios and domain, and the most
commonly used metrics have faced criticism in the literature. This paper
provides a comprehensive overview of the metrics used for the evaluation of
time series anomaly detection methods, and also defines a taxonomy of these
based on how they are calculated. By defining a set of properties for
evaluation metrics and a set of specific case studies and experiments, twenty
metrics are analyzed and discussed in detail, highlighting the unique
suitability of each for specific tasks. Through extensive experimentation and
analysis, this paper argues that the choice of evaluation metric must be made
with care, taking into account the specific requirements of the task at hand.
Related papers
- FSDEM: Feature Selection Dynamic Evaluation Metric [1.54369283425087]
The proposed metric is a dynamic metric with two properties that can be used to evaluate both the performance and the stability of a feature selection algorithm.
We conduct several empirical experiments to illustrate the use of the proposed metric in the successful evaluation of feature selection algorithms.
arXiv Detail & Related papers (2024-08-26T12:49:41Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - On Pixel-level Performance Assessment in Anomaly Detection [87.7131059062292]
Anomaly detection methods have demonstrated remarkable success across various applications.
However, assessing their performance, particularly at the pixel-level, presents a complex challenge.
In this paper, we dissect the intricacies of this challenge, underscored by visual evidence and statistical analysis.
arXiv Detail & Related papers (2023-10-25T08:02:27Z) - Unsupervised Anomaly Detection in Time-series: An Extensive Evaluation and Analysis of State-of-the-art Methods [10.618572317896515]
Unsupervised anomaly detection in time-series has been extensively investigated in the literature.
This paper proposes an in-depth evaluation study of recent unsupervised anomaly detection techniques in time-series.
arXiv Detail & Related papers (2022-12-06T15:05:54Z) - A Comparative Study on Unsupervised Anomaly Detection for Time Series:
Experiments and Analysis [28.79393419730138]
Time series anomaly detection is often essential to enable reliability and safety.
Many recent studies target anomaly detection for time series data.
We introduce for data, methods, and evaluation strategies.
We systematically evaluate and compare state-of-the-art traditional as well as deep learning techniques.
arXiv Detail & Related papers (2022-09-10T10:44:25Z) - Estimation of Fair Ranking Metrics with Incomplete Judgments [70.37717864975387]
We propose a sampling strategy and estimation technique for four fair ranking metrics.
We formulate a robust and unbiased estimator which can operate even with very limited number of labeled items.
arXiv Detail & Related papers (2021-08-11T10:57:00Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Characterizing and comparing external measures for the assessment of
cluster analysis and community detection [1.5543116359698947]
Many external evaluation measures have been proposed in the literature to compare two partitions of the same set.
This makes the task of selecting the most appropriate measure for a given situation a challenge for the end user.
We propose a new empirical evaluation framework to solve this issue, and help the end user selecting an appropriate measure for their application.
arXiv Detail & Related papers (2021-02-01T09:10:25Z) - Quantitative Evaluations on Saliency Methods: An Experimental Study [6.290238942982972]
We briefly summarize the status quo of the metrics, including faithfulness, localization, false-positives, sensitivity check, and stability.
We conclude that among all the methods we compare, no single explanation method dominates others in all metrics.
arXiv Detail & Related papers (2020-12-31T14:13:30Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.