Related papers: What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

URL: http://arxiv.org/abs/2510.13232v1
Date: Wed, 15 Oct 2025 07:36:38 GMT
Title: What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
Authors: Inha Kang, Youngsun Lim, Seonho Lee, Jiho Choi, Junsuk Choe, Hyunjung Shim,
Abstract summary: State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias.<n>We introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data.<n>Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias.
Score: 42.41372222021938
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.

Related papers

Negation-Aware Test-Time Adaptation for Vision-Language Models [26.043679706381646]
We study a practical but less-touched problem in Vision-Language Models (VLMs)<n>Many real-world applications require models to explicitly identify what is false or non-existent.<n>We propose a Negation-Aware Test-Time Adaptation (NEAT) method to efficiently adjust distribution-related parameters during inference.
arXiv Detail & Related papers (2025-07-25T08:25:48Z)
Learning Robust Negation Text Representations [60.23044940174016]
We propose a strategy to improve negation of text encoders using diverse patterns of negation and hedging.<n>We observe large improvement in negation understanding capabilities while maintaining competitive performance on general benchmarks.<n>Our method can be adapted to LLMs, leading to improved performance on negation benchmarks.
arXiv Detail & Related papers (2025-07-17T04:48:54Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP [57.33324843049638]
We introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions.<n>Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality.<n>Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately.
arXiv Detail & Related papers (2025-01-19T01:17:05Z)
Vision-Language Models Do Not Understand Negation [50.27667000027403]
NegBench is a benchmark designed to evaluate negation understanding across 18 task variations and $79$k examples.<n>We show that this approach can result in a 10% increase in recall on negated queries and a 28% boost in accuracy on multiple-choice questions with negated captions.
arXiv Detail & Related papers (2025-01-16T09:55:42Z)
Probing structural constraints of negation in Pretrained Language Models [1.8749305679160366]
We use probes to identify which contextual representations best encode the presence of negation in a sentence. We find that contextual representations of tokens inside the negation scope do allow for (i) a better prediction of the presence of not compared to those outside the scope. Yet, further control experiments reveal that the presence of other lexical items is also better captured when using the contextual representation of a token within the same syntactic clause.
arXiv Detail & Related papers (2024-08-06T09:54:49Z)
Negation Triplet Extraction with Syntactic Dependency and Semantic Consistency [37.99421732397288]
SSENE is built based on a generative pretrained language model (PLM) of-Decoder architecture with a multi-task learning framework. We have constructed a high-quality Chinese dataset NegComment based on the users' reviews from the real-world platform of Meituan.
arXiv Detail & Related papers (2024-04-15T14:28:33Z)
AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs) Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.