SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models
- URL: http://arxiv.org/abs/2511.12331v1
- Date: Sat, 15 Nov 2025 19:18:40 GMT
- Title: SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models
- Authors: Sepehr Kazemi Ranjbar, Kumail Alhamoud, Marzyeh Ghassemi,
- Abstract summary: We show that the embedding space of Vision-Language Models can be divided into semantically consistent subspaces.<n>We propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point.<n>Our method improves negation understanding by about 30% on average over prior methods.
- Score: 17.194017001016135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Models (VLMs) struggle with negation. Given a prompt like "retrieve (or generate) a street scene without pedestrians," they often fail to respect the "not." Existing methods address this limitation by fine-tuning on large negation datasets, but such retraining often compromises the model's zero-shot performance on affirmative prompts. We show that the embedding space of VLMs, such as CLIP, can be divided into semantically consistent subspaces. Based on this property, we propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point (Figure 1). To find the matching image for a caption such as "A but not N," we construct two spherical caps around the embeddings of A and N, and we score images by the central direction of the region that is close to A and far from N. Across retrieval, MCQ, and text-to-image tasks, our method improves negation understanding by about 30% on average over prior methods. It closes the gap between affirmative and negated prompts while preserving the zero-shot performance that fine-tuned models fail to maintain. Code will be released upon publication.
Related papers
- Towards Effective Negation Modeling in Joint Audio-Text Models for Music [3.7723788828505125]
Joint audio-text models struggle with semantic phenomena such as negation.<n>We introduce negation through text augmentation and a dissimilarity-based contrastive loss.<n>We propose two protocols that frame negation modeling as retrieval and binary classification tasks.
arXiv Detail & Related papers (2026-01-20T13:06:48Z) - What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging [42.41372222021938]
State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias.<n>We introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data.<n>Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias.
arXiv Detail & Related papers (2025-10-15T07:36:38Z) - Diffusion Models with Adaptive Negative Sampling Without External Resources [54.84368884047812]
ANSWER is a training-free technique, applicable to any model that supports CFG, and allows for negative grounding of image concepts without an explicit negative prompts.<n>Experiments show that adding ANSWER to existing DMs outperforms the baselines on multiple benchmarks and is preferred by humans 2x more over the other methods.
arXiv Detail & Related papers (2025-08-05T00:45:54Z) - Negation-Aware Test-Time Adaptation for Vision-Language Models [26.043679706381646]
We study a practical but less-touched problem in Vision-Language Models (VLMs)<n>Many real-world applications require models to explicitly identify what is false or non-existent.<n>We propose a Negation-Aware Test-Time Adaptation (NEAT) method to efficiently adjust distribution-related parameters during inference.
arXiv Detail & Related papers (2025-07-25T08:25:48Z) - Vision-Language Models Do Not Understand Negation [50.27667000027403]
NegBench is a benchmark designed to evaluate negation understanding across 18 task variations and $79$k examples.<n>We show that this approach can result in a 10% increase in recall on negated queries and a 28% boost in accuracy on multiple-choice questions with negated captions.
arXiv Detail & Related papers (2025-01-16T09:55:42Z) - SeqZero: Few-shot Compositional Semantic Parsing with Sequential Prompts
and Zero-shot Models [57.29358388475983]
Recent research showed promising results on combining pretrained language models with canonical utterance.
We propose a novel few-shot semantic parsing method -- SeqZero.
In particular, SeqZero brings out the merits from both models via ensemble equipped with our proposed constrained rescaling.
arXiv Detail & Related papers (2022-05-15T21:13:15Z) - Debiased Contrastive Learning of Unsupervised Sentence Representations [88.58117410398759]
Contrastive learning is effective in improving pre-trained language models (PLM) to derive high-quality sentence representations.
Previous works mostly adopt in-batch negatives or sample from training data at random.
We present a new framework textbfDCLR to alleviate the influence of these improper negatives.
arXiv Detail & Related papers (2022-05-02T05:07:43Z) - Unsupervised Deep Learning Meets Chan-Vese Model [77.24463525356566]
We propose an unsupervised image segmentation approach that integrates the Chan-Vese (CV) model with deep neural networks.
Our basic idea is to apply a deep neural network that maps the image into a latent space to alleviate the violation of the piecewise constant assumption in image space.
arXiv Detail & Related papers (2022-04-14T13:23:57Z) - Contrastive Neighborhood Alignment [81.65103777329874]
We present Contrastive Neighborhood Alignment (CNA), a manifold learning approach to maintain the topology of learned features.
The target model aims to mimic the local structure of the source representation space using a contrastive loss.
CNA is illustrated in three scenarios: manifold learning, where the model maintains the local topology of the original data in a dimension-reduced space; model distillation, where a small student model is trained to mimic a larger teacher; and legacy model update, where an older model is replaced by a more powerful one.
arXiv Detail & Related papers (2022-01-06T04:58:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.