Analysing the Robustness of Vision-Language-Models to Common Corruptions
- URL: http://arxiv.org/abs/2504.13690v2
- Date: Mon, 21 Apr 2025 17:07:18 GMT
- Title: Analysing the Robustness of Vision-Language-Models to Common Corruptions
- Authors: Muhammad Usama, Syeda Aishah Asim, Syed Bilal Ali, Syed Talal Wasim, Umair Bin Mansoor,
- Abstract summary: Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content.<n>We present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark.<n>We introduce two new benchmarks, TextVQA-C and GQA-C, to evaluate how corruptions affect scene text understanding and object-based reasoning.
- Score: 2.9459935333120972
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content. However, their robustness to common image corruptions remains under-explored. In this work, we present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark, spanning four categories: noise, blur, weather, and digital distortions. We introduce two new benchmarks, TextVQA-C and GQA-C, to systematically evaluate how corruptions affect scene text understanding and object-based reasoning, respectively. Our analysis reveals that transformer-based VLMs exhibit distinct vulnerability patterns across tasks: text recognition deteriorates most severely under blur and snow corruptions, while object reasoning shows higher sensitivity to corruptions such as frost and impulse noise. We connect these observations to the frequency-domain characteristics of different corruptions, revealing how transformers' inherent bias toward low-frequency processing explains their differential robustness patterns. Our findings provide valuable insights for developing more corruption-robust vision-language models for real-world applications.
Related papers
- Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions [49.546479320670464]
This paper introduces specialized metrics for benchmarking the spatial robustness of segmentation models.<n>We propose region-aware multi-attack adversarial analysis, a method that enables a deeper understanding of model robustness.<n>The results reveal that models respond to these two types of threats differently.
arXiv Detail & Related papers (2025-04-02T11:37:39Z) - Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models [93.46875303598577]
Vision-language models (VLMs) have advanced rapidly in processing multimodal information, but their ability to reconcile conflicting signals remains underexplored.
This work investigates how VLMs process ASCII art, a unique medium where textual elements collectively form visual patterns, potentially creating semantic-visual conflicts.
arXiv Detail & Related papers (2025-04-02T10:47:07Z) - Indoor scene recognition from images under visual corruptions [3.4861209026118836]
This paper presents an innovative approach to indoor scene recognition that leverages multimodal data fusion.
We examine two multimodal networks that synergize visual features from CNN models with semantic captions via a Graph Convolutional Network (GCN)
Our study shows that this fusion improves markedly model performance, with notable gains in Top-1 accuracy when evaluated against a corrupted subset of the Places365 dataset.
arXiv Detail & Related papers (2024-08-23T12:35:45Z) - Towards Evaluating the Robustness of Visual State Space Models [63.14954591606638]
Vision State Space Models (VSSMs) have demonstrated remarkable performance in visual perception tasks.
However, their robustness under natural and adversarial perturbations remains a critical concern.
We present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios.
arXiv Detail & Related papers (2024-06-13T17:59:44Z) - RobustCLEVR: A Benchmark and Framework for Evaluating Robustness in
Object-centric Learning [9.308581290987783]
We present the RobustCLEVR benchmark dataset and evaluation framework.
Our framework takes a novel approach to evaluating robustness by enabling the specification of causal dependencies.
Overall, we find that object-centric methods are not inherently robust to image corruptions.
arXiv Detail & Related papers (2023-08-28T20:52:18Z) - Hierarchical Contrastive Learning for Pattern-Generalizable Image
Corruption Detection [40.04083743934034]
We develop a hierarchical contrastive learning framework to detect corrupted regions.
A specialized hierarchical interaction mechanism is designed to facilitate the knowledge of contrastive learning in different scales.
Our model has well generalization ability across different corruption patterns.
arXiv Detail & Related papers (2023-08-27T10:03:48Z) - Frequency-Based Vulnerability Analysis of Deep Learning Models against
Image Corruptions [48.34142457385199]
We present MUFIA, an algorithm designed to identify the specific types of corruptions that can cause models to fail.
We find that even state-of-the-art models trained to be robust against known common corruptions struggle against the low visibility-based corruptions crafted by MUFIA.
arXiv Detail & Related papers (2023-06-12T15:19:13Z) - A Survey on the Robustness of Computer Vision Models against Common Corruptions [3.6486148851646063]
Computer vision models are susceptible to changes in input images caused by sensor errors or extreme imaging environments.
These corruptions can significantly hinder the reliability of these models when deployed in real-world scenarios.
We present a comprehensive overview of methods that improve the robustness of computer vision models against common corruptions.
arXiv Detail & Related papers (2023-05-10T10:19:31Z) - Improving robustness against common corruptions with frequency biased
models [112.65717928060195]
unseen image corruptions can cause a surprisingly large drop in performance.
Image corruption types have different characteristics in the frequency spectrum and would benefit from a targeted type of data augmentation.
We propose a new regularization scheme that minimizes the total variation (TV) of convolution feature-maps to increase high-frequency robustness.
arXiv Detail & Related papers (2021-03-30T10:44:50Z) - On Interaction Between Augmentations and Corruptions in Natural
Corruption Robustness [78.6626755563546]
Several new data augmentations have been proposed that significantly improve performance on ImageNet-C.
We develop a new measure in this space between augmentations and corruptions called the Minimal Sample Distance to demonstrate there is a strong correlation between similarity and performance.
We observe a significant degradation in corruption robustness when the test-time corruptions are sampled to be perceptually dissimilar from ImageNet-C.
Our results suggest that test error can be improved by training on perceptually similar augmentations, and data augmentations may not generalize well beyond the existing benchmark.
arXiv Detail & Related papers (2021-02-22T18:58:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.