Related papers: SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models

SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models

URL: http://arxiv.org/abs/2511.12331v1
Date: Sat, 15 Nov 2025 19:18:40 GMT
Title: SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models
Authors: Sepehr Kazemi Ranjbar, Kumail Alhamoud, Marzyeh Ghassemi,
Abstract summary: We show that the embedding space of Vision-Language Models can be divided into semantically consistent subspaces.<n>We propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point.<n>Our method improves negation understanding by about 30% on average over prior methods.
Score: 17.194017001016135
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) struggle with negation. Given a prompt like "retrieve (or generate) a street scene without pedestrians," they often fail to respect the "not." Existing methods address this limitation by fine-tuning on large negation datasets, but such retraining often compromises the model's zero-shot performance on affirmative prompts. We show that the embedding space of VLMs, such as CLIP, can be divided into semantically consistent subspaces. Based on this property, we propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point (Figure 1). To find the matching image for a caption such as "A but not N," we construct two spherical caps around the embeddings of A and N, and we score images by the central direction of the region that is close to A and far from N. Across retrieval, MCQ, and text-to-image tasks, our method improves negation understanding by about 30% on average over prior methods. It closes the gap between affirmative and negated prompts while preserving the zero-shot performance that fine-tuned models fail to maintain. Code will be released upon publication.

Related papers

Towards Effective Negation Modeling in Joint Audio-Text Models for Music [3.7723788828505125]
Joint audio-text models struggle with semantic phenomena such as negation.<n>We introduce negation through text augmentation and a dissimilarity-based contrastive loss.<n>We propose two protocols that frame negation modeling as retrieval and binary classification tasks.
arXiv Detail & Related papers (2026-01-20T13:06:48Z)
What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging [42.41372222021938]
State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias.<n>We introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data.<n>Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias.
arXiv Detail & Related papers (2025-10-15T07:36:38Z)
Diffusion Models with Adaptive Negative Sampling Without External Resources [54.84368884047812]
ANSWER is a training-free technique, applicable to any model that supports CFG, and allows for negative grounding of image concepts without an explicit negative prompts.<n>Experiments show that adding ANSWER to existing DMs outperforms the baselines on multiple benchmarks and is preferred by humans 2x more over the other methods.
arXiv Detail & Related papers (2025-08-05T00:45:54Z)
Negation-Aware Test-Time Adaptation for Vision-Language Models [26.043679706381646]
We study a practical but less-touched problem in Vision-Language Models (VLMs)<n>Many real-world applications require models to explicitly identify what is false or non-existent.<n>We propose a Negation-Aware Test-Time Adaptation (NEAT) method to efficiently adjust distribution-related parameters during inference.
arXiv Detail & Related papers (2025-07-25T08:25:48Z)
Vision-Language Models Do Not Understand Negation [50.27667000027403]
NegBench is a benchmark designed to evaluate negation understanding across 18 task variations and $79$k examples.<n>We show that this approach can result in a 10% increase in recall on negated queries and a 28% boost in accuracy on multiple-choice questions with negated captions.
arXiv Detail & Related papers (2025-01-16T09:55:42Z)
SeqZero: Few-shot Compositional Semantic Parsing with Sequential Prompts and Zero-shot Models [57.29358388475983]
Recent research showed promising results on combining pretrained language models with canonical utterance. We propose a novel few-shot semantic parsing method -- SeqZero. In particular, SeqZero brings out the merits from both models via ensemble equipped with our proposed constrained rescaling.
arXiv Detail & Related papers (2022-05-15T21:13:15Z)
Debiased Contrastive Learning of Unsupervised Sentence Representations [88.58117410398759]
Contrastive learning is effective in improving pre-trained language models (PLM) to derive high-quality sentence representations. Previous works mostly adopt in-batch negatives or sample from training data at random. We present a new framework textbfDCLR to alleviate the influence of these improper negatives.
arXiv Detail & Related papers (2022-05-02T05:07:43Z)
Unsupervised Deep Learning Meets Chan-Vese Model [77.24463525356566]
We propose an unsupervised image segmentation approach that integrates the Chan-Vese (CV) model with deep neural networks. Our basic idea is to apply a deep neural network that maps the image into a latent space to alleviate the violation of the piecewise constant assumption in image space.
arXiv Detail & Related papers (2022-04-14T13:23:57Z)
Contrastive Neighborhood Alignment [81.65103777329874]
We present Contrastive Neighborhood Alignment (CNA), a manifold learning approach to maintain the topology of learned features. The target model aims to mimic the local structure of the source representation space using a contrastive loss. CNA is illustrated in three scenarios: manifold learning, where the model maintains the local topology of the original data in a dimension-reduced space; model distillation, where a small student model is trained to mimic a larger teacher; and legacy model update, where an older model is replaced by a more powerful one.
arXiv Detail & Related papers (2022-01-06T04:58:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.