Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness
- URL: http://arxiv.org/abs/2506.05917v1
- Date: Fri, 06 Jun 2025 09:37:45 GMT
- Title: Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness
- Authors: Steven Landgraf, Markus Hillemann, Markus Ulrich,
- Abstract summary: Reliable Score ( RSS) is a novel metric that combines predictive accuracy, calibration, and uncertainty quality measures via a harmonic mean.<n>We advocate for a shift in evaluation protocols toward more holistic metrics like RSS to better align semi-supervised learning research with real-world deployment needs.
- Score: 10.220692937750295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semantic segmentation is critical for scene understanding but demands costly pixel-wise annotations, attracting increasing attention to semi-supervised approaches to leverage abundant unlabeled data. While semi-supervised segmentation is often promoted as a path toward scalable, real-world deployment, it is astonishing that current evaluation protocols exclusively focus on segmentation accuracy, entirely overlooking reliability and robustness. These qualities, which ensure consistent performance under diverse conditions (robustness) and well-calibrated model confidences as well as meaningful uncertainties (reliability), are essential for safety-critical applications like autonomous driving, where models must handle unpredictable environments and avoid sudden failures at all costs. To address this gap, we introduce the Reliable Segmentation Score (RSS), a novel metric that combines predictive accuracy, calibration, and uncertainty quality measures via a harmonic mean. RSS penalizes deficiencies in any of its components, providing an easy and intuitive way of holistically judging segmentation models. Comprehensive evaluations of UniMatchV2 against its predecessor and a supervised baseline show that semi-supervised methods often trade reliability for accuracy. While out-of-domain evaluations demonstrate UniMatchV2's robustness, they further expose persistent reliability shortcomings. We advocate for a shift in evaluation protocols toward more holistic metrics like RSS to better align semi-supervised learning research with real-world deployment needs.
Related papers
- SURE: Semi-dense Uncertainty-REfined Feature Matching [28.68008638977835]
SURE is a Semi- dense Uncertainty-REfined matching framework that jointly predicts correspondences and their confidence.<n>Our approach in- troduces a novel evidential head for trustworthy coordinate regression, along with a lightweight spatial fusion module.<n>Our method consistently outperforms existing state-of-the-art semi-dense matching models in both accuracy and efficiency.
arXiv Detail & Related papers (2026-03-05T06:53:11Z) - Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness [4.129847064263056]
We systematically evaluate the performance of Large Language Models for rubric-based short-answer grading.<n>We find that alignment is strong for binary tasks but degrades with increased rubric granularity.<n>Experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions.
arXiv Detail & Related papers (2025-12-21T05:22:04Z) - Robust Partial 3D Point Cloud Registration via Confidence Estimation under Global Context [12.216399037814012]
Partial point cloud registration is essential for autonomous perception and 3D scene understanding.<n>We propose Confidence Estimation under Global Context (CEGC), a unified, confidence-driven framework for robust partial 3D registration.<n>CEGC enables accurate alignment in complex scenes by jointly modeling overlap confidence and correspondence reliability within a shared global context.
arXiv Detail & Related papers (2025-09-29T04:36:55Z) - Revisiting Multivariate Time Series Forecasting with Missing Values [65.30332997607141]
Missing values are common in real-world time series.<n>Current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data.<n>This framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy.<n>We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle.
arXiv Detail & Related papers (2025-09-27T20:57:48Z) - Aurora: Are Android Malware Classifiers Reliable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is further complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility we observe in state-of-the-art frameworks suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z) - TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution Data [62.22804234013273]
We propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts.<n>Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters, we can enhance the failure detection ability effectively and flexibly.
arXiv Detail & Related papers (2025-04-20T09:20:55Z) - Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z) - Calibrated and Efficient Sampling-Free Confidence Estimation for LiDAR Scene Semantic Segmentation [1.8861801513235323]
We introduce a sampling-free approach for estimating well-calibrated confidence values for classification tasks.<n>Our approach maintains well-calibrated confidence values while achieving increased processing speed.<n>Our method produces underconfidence rather than overconfident predictions, an advantage for safety-critical applications.
arXiv Detail & Related papers (2024-11-18T15:13:20Z) - ReliOcc: Towards Reliable Semantic Occupancy Prediction via Uncertainty Learning [26.369237406972577]
Vision-centric semantic occupancy prediction plays a crucial role in autonomous driving.
There is still few research effort to explore the reliability in predicting semantic occupancy from camera.
We propose ReliOcc, a method designed to enhance the reliability of camera-based occupancy networks.
arXiv Detail & Related papers (2024-09-26T16:33:16Z) - The BRAVO Semantic Segmentation Challenge Results in UNCV2024 [68.20197719071436]
We define two categories of reliability: (1) semantic reliability, which reflects the model's accuracy and calibration when exposed to various perturbations; and (2) OOD reliability, which measures the model's ability to detect object classes that are unknown during training.
The results reveal interesting insights into the importance of large-scale pre-training and minimal architectural design in developing robust and reliable semantic segmentation models.
arXiv Detail & Related papers (2024-09-23T15:17:30Z) - Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers [9.147975682184528]
Decision making in deep learning models can be sensitive to imperceptible perturbations.
evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios.
This paper introduces the concept of margin consistency for efficient detection of vulnerable samples.
arXiv Detail & Related papers (2024-06-26T16:00:35Z) - TRUST-LAPSE: An Explainable and Actionable Mistrust Scoring Framework
for Model Monitoring [4.262769931159288]
We propose TRUST-LAPSE, a "mistrust" scoring framework for continuous model monitoring.
We assess the trustworthiness of each input sample's model prediction using a sequence of latent-space embeddings.
Our latent-space mistrust scores achieve state-of-the-art results with AUROCs of 84.1 (vision), 73.9 (audio), and 77.1 (clinical EEGs)
arXiv Detail & Related papers (2022-07-22T18:32:38Z) - Adversarial Robustness under Long-Tailed Distribution [93.50792075460336]
Adversarial robustness has attracted extensive studies recently by revealing the vulnerability and intrinsic characteristics of deep networks.
In this work we investigate the adversarial vulnerability as well as defense under long-tailed distributions.
We propose a clean yet effective framework, RoBal, which consists of two dedicated modules, a scale-invariant and data re-balancing.
arXiv Detail & Related papers (2021-04-06T17:53:08Z) - Approaching Neural Network Uncertainty Realism [53.308409014122816]
Quantifying or at least upper-bounding uncertainties is vital for safety-critical systems such as autonomous vehicles.
We evaluate uncertainty realism -- a strict quality criterion -- with a Mahalanobis distance-based statistical test.
We adopt it to the automotive domain and show that it significantly improves uncertainty realism compared to a plain encoder-decoder model.
arXiv Detail & Related papers (2021-01-08T11:56:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.