Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric
- URL: http://arxiv.org/abs/2511.19032v1
- Date: Mon, 24 Nov 2025 12:07:56 GMT
- Title: Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric
- Authors: Xiangjie Sui, Songyang Li, Hanwei Zhu, Baoliang Chen, Yuming Fang, Xin Sun,
- Abstract summary: We introduce Bench-C, a benchmark emphasizing discriminative samples for assessing corruption robustness.<n>We propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure.
- Score: 49.393713730706445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.
Related papers
- Calibrating Uncertainty for Zero-Shot Adversarial CLIP [33.707647228637114]
We propose a novel adversarial fine-tuning objective for CLIP considering both prediction accuracy and uncertainty alignments.<n>Our objective aligns these distributions holistically under perturbations, moving beyond single-logit anchoring and restoring uncertainty.
arXiv Detail & Related papers (2025-12-15T05:41:08Z) - TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution Data [62.22804234013273]
We propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts.<n>Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters, we can enhance the failure detection ability effectively and flexibly.
arXiv Detail & Related papers (2025-04-20T09:20:55Z) - Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions [49.546479320670464]
This paper introduces specialized metrics for benchmarking the spatial robustness of segmentation models.<n>We propose region-aware multi-attack adversarial analysis, a method that enables a deeper understanding of model robustness.<n>The results reveal that models respond to these two types of threats differently.
arXiv Detail & Related papers (2025-04-02T11:37:39Z) - Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models.
We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data.
Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z) - Uncertainty Estimation for Heatmap-based Landmark Localization [4.673063715963989]
We propose Quantile Binning, a data-driven method to categorise predictions by uncertainty with estimated error bounds.
We demonstrate this framework by comparing and contrasting three uncertainty measures.
We conclude by illustrating how filtering out gross mispredictions caught in our Quantile Bins significantly improves the proportion of predictions under an acceptable error threshold.
arXiv Detail & Related papers (2022-03-04T14:40:44Z) - Robustness and Accuracy Could Be Reconcilable by (Proper) Definition [109.62614226793833]
The trade-off between robustness and accuracy has been widely studied in the adversarial literature.
We find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance.
By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty.
arXiv Detail & Related papers (2022-02-21T10:36:09Z) - Trust but Verify: Assigning Prediction Credibility by Counterfactual
Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning.
These measures should account for the wide variety of models used in practice.
The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z) - Sparse representation for damage identification of structural systems [11.397437423613418]
We propose a novel two-stage sensitivity analysis-based framework for both model updating and sparse damage identification.
A sparse representation pipeline built on a quasi-$ell$ method is then presented for damage and localization quantification.
Results show that the proposed approach is capable of both localizing and quantifying structural damage with high accuracy.
arXiv Detail & Related papers (2020-06-06T18:04:35Z) - Model Uncertainty Quantification for Reliable Deep Vision Structural
Health Monitoring [2.5126058470073263]
This paper proposes Bayesian inference for deep vision structural health monitoring models.
Uncertainty can be quantified using the Monte Carlo dropout sampling.
Three independent case studies for cracks, local damage identification, and bridge component detection are investigated.
arXiv Detail & Related papers (2020-04-10T17:54:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.