Related papers: Visual Robustness Benchmark for Visual Question Answering (VQA)

Visual Robustness Benchmark for Visual Question Answering (VQA)

URL: http://arxiv.org/abs/2407.03386v5
Date: Tue, 29 Oct 2024 08:50:27 GMT
Title: Visual Robustness Benchmark for Visual Question Answering (VQA)
Authors: Md Farhan Ishmam, Ishmam Tashdeed, Talukder Asir Saadat, Md Hamjajul Ashmafee, Abu Raihan Mostofa Kamal, Md. Azam Hossain,
Abstract summary: We propose the first large-scale benchmark comprising 213,000 augmented images. We challenge the visual robustness of multiple VQA models and assess the strength of realistic visual corruptions.
Score: 0.08246494848934446
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Can Visual Question Answering (VQA) systems perform just as well when deployed in the real world? Or are they susceptible to realistic corruption effects e.g. image blur, which can be detrimental in sensitive applications, such as medical VQA? While linguistic or textual robustness has been thoroughly explored in the VQA literature, there has yet to be any significant work on the visual robustness of VQA models. We propose the first large-scale benchmark comprising 213,000 augmented images, challenging the visual robustness of multiple VQA models and assessing the strength of realistic visual corruptions. Additionally, we have designed several robustness evaluation metrics that can be aggregated into a unified metric and tailored to fit a variety of use cases. Our experiments reveal several insights into the relationships between model size, performance, and robustness with the visual corruptions. Our benchmark highlights the need for a balanced approach in model development that considers model performance without compromising the robustness.

Related papers

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios [19.156760664417718]
We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering capabilities of multi-model AI assistant on wearable devices like smart glasses.<n>WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry.<n>The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains.
arXiv Detail & Related papers (2025-11-27T06:44:49Z)
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [95.89543460132413]
Vision-language models (VLMs) have improved performance by increasing the number of visual tokens.<n>However, most real-world scenarios do not require such an extensive number of visual tokens.<n>We present a new paradigm for visual token compression, namely, VisionThink.
arXiv Detail & Related papers (2025-07-17T17:59:55Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering [20.35687327831644]
We study the robustness of Visual Question Answering (VQA) models from a novel perspective: visual context. SwapMix perturbs the visual context by swapping features of irrelevant context objects with features from other objects in the dataset. We train the models with perfect sight and find that the context over-reliance highly depends on the quality of visual representations.
arXiv Detail & Related papers (2022-04-05T15:32:25Z)
COIN: Counterfactual Image Generation for VQA Interpretation [5.994412766684842]
We introduce an interpretability approach for VQA models by generating counterfactual images. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
arXiv Detail & Related papers (2022-01-10T13:51:35Z)
Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples. We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z)
Learning from Lexical Perturbations for Consistent Visual Question Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations. We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations. We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z)
RobustBench: a standardized adversarial robustness benchmark [84.50044645539305]
Key challenge in benchmarking robustness is that its evaluation is often error-prone leading to robustness overestimation. We evaluate adversarial robustness with AutoAttack, an ensemble of white- and black-box attacks. We analyze the impact of robustness on the performance on distribution shifts, calibration, out-of-distribution detection, fairness, privacy leakage, smoothness, and transferability.
arXiv Detail & Related papers (2020-10-19T17:06:18Z)
Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses. We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z)
Regularizing Attention Networks for Anomaly Detection in Visual Question Answering [10.971443035470488]
We evaluate the robustness of state-of-the-art VQA models to five different anomalies. We propose an attention-based method, which uses confidence of reasoning between input images and questions. We show that a maximum entropy regularization of attention networks can significantly improve the attention-based anomaly detection.
arXiv Detail & Related papers (2020-09-21T17:47:49Z)
Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions. We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task. We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.