NegVQA: Can Vision Language Models Understand Negation?
- URL: http://arxiv.org/abs/2505.22946v1
- Date: Wed, 28 May 2025 23:58:37 GMT
- Title: NegVQA: Can Vision Language Models Understand Negation?
- Authors: Yuhui Zhang, Yuchang Su, Yiming Liu, Serena Yeung-Levy,
- Abstract summary: NegVQA is a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions.<n>We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets.<n>We evaluate 20 state-of-the-art vision language models across seven model families and find that these models struggle significantly with negation.
- Score: 10.58857445465026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.
Related papers
- Negation: A Pink Elephant in the Large Language Models' Room? [2.8078480738404]
Negations are key to determining sentence meaning, making them essential for logical reasoning.<n>Despite their importance, negations pose a substantial challenge for large language models (LLMs)
arXiv Detail & Related papers (2025-03-28T13:04:41Z) - Vision-Language Models Do Not Understand Negation [50.27667000027403]
NegBench is a benchmark designed to evaluate negation understanding across 18 task variations and $79$k examples.<n>We show that this approach can result in a 10% increase in recall on negated queries and a 28% boost in accuracy on multiple-choice questions with negated captions.
arXiv Detail & Related papers (2025-01-16T09:55:42Z) - NeIn: Telling What You Don't Want [6.666707176043472]
This paper presents the first large-scale dataset, Negative Instruction (NeIn), for studying negation within instruction-based image editing.<n>NeIn comprises 366,957 quintuplets, i.e., source image, original caption, selected object, negative sentence, and target image in total, including 342,775 queries for training and 24,182 queries for benchmarking image editing methods.
arXiv Detail & Related papers (2024-09-09T04:54:34Z) - What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models [22.0839948292609]
We introduce a novel dataset, C-VQA, specifically designed to test the counterfactual reasoning capabilities of modern language models.
This dataset is constructed by infusing original questions with various types such as numerical and counter-language queries.
Our evaluations of contemporary vision models using this dataset have revealed substantial performance drops, with some models showing up to a 40% decrease.
arXiv Detail & Related papers (2023-10-10T13:45:59Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models [67.8024390595066]
NOPE (Negative Object Presence Evaluation) is a novel benchmark designed to assess object hallucination in vision-language (VL) models.
We extensively investigate the performance of 10 state-of-the-art VL models in discerning the non-existence of objects in visual questions.
arXiv Detail & Related papers (2023-10-09T01:52:27Z) - Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering [58.64831511644917]
We introduce an interpretable by design model that factors model decisions into intermediate human-legible explanations.
We show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions.
arXiv Detail & Related papers (2023-05-24T08:33:15Z) - CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about
Negation [21.56001677478673]
We present the first English reading comprehension dataset which requires reasoning about the implications of negated statements in paragraphs.
CONDAQA features 14,182 question-answer pairs with over 200 unique negation cues.
The best performing model on CONDAQA (UnifiedQA-v2-3b) achieves only 42% on our consistency metric, well below human performance which is 81%.
arXiv Detail & Related papers (2022-11-01T06:10:26Z) - AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss [73.65872901950135]
This work attempts to tackle the language prior problem from the viewpoint of the feature space learning.
An adapted margin cosine loss is designed to discriminate the frequent and the sparse answer feature space.
Experimental results demonstrate that our adapted margin cosine loss can greatly enhance the baseline models.
arXiv Detail & Related papers (2021-05-05T11:41:38Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Reducing Language Biases in Visual Question Answering with
Visually-Grounded Question Encoder [12.56413718364189]
We propose a novel model-agnostic question encoder, Visually-Grounded Question (VGQE) for VQA.
VGQE utilizes both visual and language modalities equally while encoding the question.
We demonstrate the effect of VGQE on three recent VQA models and achieve state-of-the-art results.
arXiv Detail & Related papers (2020-07-13T05:36:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.