Toward Interpretable Evaluation Measures for Time Series Segmentation
- URL: http://arxiv.org/abs/2510.23261v1
- Date: Mon, 27 Oct 2025 12:23:37 GMT
- Title: Toward Interpretable Evaluation Measures for Time Series Segmentation
- Authors: Félix Chavelli, Paul Boniol, Michaël Thomazo,
- Abstract summary: We introduce WARI (Weighted Adjusted Rand Index), that accounts for the position of segmentation errors, and SMS (State Matching Score), a fine-grained measure that identifies and scores four fundamental types of segmentation errors while allowing error-specific weighting.<n>We empirically validate WARI and SMS on synthetic and real-world benchmarks, showing that they not only provide a more accurate assessment of segmentation quality but also uncover insights, such as error and type, that provenance are inaccessible with traditional measures.
- Score: 3.726498599140168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Time series segmentation is a fundamental task in analyzing temporal data across various domains, from human activity recognition to energy monitoring. While numerous state-of-the-art methods have been developed to tackle this problem, the evaluation of their performance remains critically limited. Existing measures predominantly focus on change point accuracy or rely on point-based measures such as Adjusted Rand Index (ARI), which fail to capture the quality of the detected segments, ignore the nature of errors, and offer limited interpretability. In this paper, we address these shortcomings by introducing two novel evaluation measures: WARI (Weighted Adjusted Rand Index), that accounts for the position of segmentation errors, and SMS (State Matching Score), a fine-grained measure that identifies and scores four fundamental types of segmentation errors while allowing error-specific weighting. We empirically validate WARI and SMS on synthetic and real-world benchmarks, showing that they not only provide a more accurate assessment of segmentation quality but also uncover insights, such as error provenance and type, that are inaccessible with traditional measures.
Related papers
- Rethinking Evaluation of Infrared Small Target Detection [105.59753496831739]
This paper introduces a hybrid-level metric incorporating pixel- and target-level performance, proposing a systematic error analysis method, and emphasizing the importance of cross-dataset evaluation.<n>An open-source toolkit has be released to facilitate standardized benchmarking.
arXiv Detail & Related papers (2025-09-21T02:45:07Z) - Rethinking Metrics and Benchmarks of Video Anomaly Detection [58.37571339811799]
Video Anomaly Detection (VAD) aims to detect anomalies that deviate from expectation.<n>Existing VAD metrics are influenced by single annotation bias.<n>Existing benchmarks lack the capability to evaluate scene overfitting of fully/weakly-supervised algorithms.
arXiv Detail & Related papers (2025-05-25T08:09:42Z) - OIPR: Evaluation for Time-series Anomaly Detection Inspired by Operator Interest [26.460594836601004]
We propose a novel set of time-series anomaly detection evaluation metrics, called OIPR.<n>OIPR models the process of operators receiving detector alarms and handling faults, utilizing area under the operator interest curve to evaluate the performance of TAD algorithms.<n>It achieves a balance between point and event perspectives, overcoming their primary limitations and offering applicability to broader situations.
arXiv Detail & Related papers (2025-03-03T07:37:24Z) - VUS: Effective and Efficient Accuracy Measures for Time-Series Anomaly Detection [17.751395424719167]
This paper extensively evaluates quality measures for time-series AD to assess their robustness under noise, misalignments, and different anomaly cardinality ratios.<n>Our results indicate that measures producing quality values independently of a threshold are more suitable for time-series AD.
arXiv Detail & Related papers (2025-02-18T22:19:52Z) - Towards Unbiased Evaluation of Time-series Anomaly Detector [6.521243384420707]
Time series anomaly detection (TSAD) is an evolving area of research motivated by its critical applications.
In this work, we propose an alternative adjustment protocol called Balanced point adjustment'' (BA)
arXiv Detail & Related papers (2024-09-19T19:02:45Z) - OoDIS: Anomaly Instance Segmentation and Detection Benchmark [57.89836988990543]
This work extends some commonly used anomaly segmentation benchmarks to include the instance segmentation and object detection tasks.<n>Our evaluation of anomaly segmentation and object detection methods shows that both of these challenges remain unsolved problems.
arXiv Detail & Related papers (2024-06-17T17:59:56Z) - Segmentation Re-thinking Uncertainty Estimation Metrics for Semantic Segmentation [12.532289778772185]
semantic segmentation is a fundamental application within machine learning.
The metric known as PAvPU (Patch Accuracy versus Patch Uncertainty) has been developed as a specialized tool for evaluating entropy-based uncertainty in image segmentation tasks.
Our investigation identifies three core deficiencies within the PAvPU framework and proposes robust solutions.
arXiv Detail & Related papers (2024-03-28T20:34:02Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - SoftED: Metrics for Soft Evaluation of Time Series Event Detection [4.139895427110409]
Time series event detection methods are evaluated mainly by standard classification metrics that focus solely on detection accuracy.<n>Inaccuracy in detecting an event can often result from its preceding or delayed effects reflected in neighboring detections.<n>This paper introduces SoftED metrics, a new set of metrics designed for soft evaluating event detection methods.
arXiv Detail & Related papers (2023-04-02T03:27:31Z) - Uncertainty-aware Score Distribution Learning for Action Quality
Assessment [91.05846506274881]
We propose an uncertainty-aware score distribution learning (USDL) approach for action quality assessment (AQA)
Specifically, we regard an action as an instance associated with a score distribution, which describes the probability of different evaluated scores.
Under the circumstance where fine-grained score labels are available, we devise a multi-path uncertainty-aware score distributions learning (MUSDL) method to explore the disentangled components of a score.
arXiv Detail & Related papers (2020-06-13T15:41:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.