Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness
- URL: http://arxiv.org/abs/2506.05917v1
- Date: Fri, 06 Jun 2025 09:37:45 GMT
- Title: Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness
- Authors: Steven Landgraf, Markus Hillemann, Markus Ulrich,
- Abstract summary: Reliable Score ( RSS) is a novel metric that combines predictive accuracy, calibration, and uncertainty quality measures via a harmonic mean.<n>We advocate for a shift in evaluation protocols toward more holistic metrics like RSS to better align semi-supervised learning research with real-world deployment needs.
- Score: 10.220692937750295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semantic segmentation is critical for scene understanding but demands costly pixel-wise annotations, attracting increasing attention to semi-supervised approaches to leverage abundant unlabeled data. While semi-supervised segmentation is often promoted as a path toward scalable, real-world deployment, it is astonishing that current evaluation protocols exclusively focus on segmentation accuracy, entirely overlooking reliability and robustness. These qualities, which ensure consistent performance under diverse conditions (robustness) and well-calibrated model confidences as well as meaningful uncertainties (reliability), are essential for safety-critical applications like autonomous driving, where models must handle unpredictable environments and avoid sudden failures at all costs. To address this gap, we introduce the Reliable Segmentation Score (RSS), a novel metric that combines predictive accuracy, calibration, and uncertainty quality measures via a harmonic mean. RSS penalizes deficiencies in any of its components, providing an easy and intuitive way of holistically judging segmentation models. Comprehensive evaluations of UniMatchV2 against its predecessor and a supervised baseline show that semi-supervised methods often trade reliability for accuracy. While out-of-domain evaluations demonstrate UniMatchV2's robustness, they further expose persistent reliability shortcomings. We advocate for a shift in evaluation protocols toward more holistic metrics like RSS to better align semi-supervised learning research with real-world deployment needs.
Related papers
- Aurora: Are Android Malware Classifiers Reliable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is further complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility we observe in state-of-the-art frameworks suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z) - TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution Data [62.22804234013273]
We propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts.<n>Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters, we can enhance the failure detection ability effectively and flexibly.
arXiv Detail & Related papers (2025-04-20T09:20:55Z) - Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z) - Calibrated and Efficient Sampling-Free Confidence Estimation for LiDAR Scene Semantic Segmentation [1.8861801513235323]
We introduce a sampling-free approach for estimating well-calibrated confidence values for classification tasks.<n>Our approach maintains well-calibrated confidence values while achieving increased processing speed.<n>Our method produces underconfidence rather than overconfident predictions, an advantage for safety-critical applications.
arXiv Detail & Related papers (2024-11-18T15:13:20Z) - ReliOcc: Towards Reliable Semantic Occupancy Prediction via Uncertainty Learning [26.369237406972577]
Vision-centric semantic occupancy prediction plays a crucial role in autonomous driving.
There is still few research effort to explore the reliability in predicting semantic occupancy from camera.
We propose ReliOcc, a method designed to enhance the reliability of camera-based occupancy networks.
arXiv Detail & Related papers (2024-09-26T16:33:16Z) - The BRAVO Semantic Segmentation Challenge Results in UNCV2024 [68.20197719071436]
We define two categories of reliability: (1) semantic reliability, which reflects the model's accuracy and calibration when exposed to various perturbations; and (2) OOD reliability, which measures the model's ability to detect object classes that are unknown during training.
The results reveal interesting insights into the importance of large-scale pre-training and minimal architectural design in developing robust and reliable semantic segmentation models.
arXiv Detail & Related papers (2024-09-23T15:17:30Z) - Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers [9.147975682184528]
Decision making in deep learning models can be sensitive to imperceptible perturbations.
evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios.
This paper introduces the concept of margin consistency for efficient detection of vulnerable samples.
arXiv Detail & Related papers (2024-06-26T16:00:35Z) - TRUST-LAPSE: An Explainable and Actionable Mistrust Scoring Framework
for Model Monitoring [4.262769931159288]
We propose TRUST-LAPSE, a "mistrust" scoring framework for continuous model monitoring.
We assess the trustworthiness of each input sample's model prediction using a sequence of latent-space embeddings.
Our latent-space mistrust scores achieve state-of-the-art results with AUROCs of 84.1 (vision), 73.9 (audio), and 77.1 (clinical EEGs)
arXiv Detail & Related papers (2022-07-22T18:32:38Z) - Adversarial Robustness under Long-Tailed Distribution [93.50792075460336]
Adversarial robustness has attracted extensive studies recently by revealing the vulnerability and intrinsic characteristics of deep networks.
In this work we investigate the adversarial vulnerability as well as defense under long-tailed distributions.
We propose a clean yet effective framework, RoBal, which consists of two dedicated modules, a scale-invariant and data re-balancing.
arXiv Detail & Related papers (2021-04-06T17:53:08Z) - Approaching Neural Network Uncertainty Realism [53.308409014122816]
Quantifying or at least upper-bounding uncertainties is vital for safety-critical systems such as autonomous vehicles.
We evaluate uncertainty realism -- a strict quality criterion -- with a Mahalanobis distance-based statistical test.
We adopt it to the automotive domain and show that it significantly improves uncertainty realism compared to a plain encoder-decoder model.
arXiv Detail & Related papers (2021-01-08T11:56:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.