Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens
- URL: http://arxiv.org/abs/2510.26302v1
- Date: Thu, 30 Oct 2025 09:41:21 GMT
- Title: Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens
- Authors: Ziliang Chen, Tianang Xiao, Jusheng Zhang, Yongsen Zheng, Xipeng Chen,
- Abstract summary: Contrastive Language-Image Pre-training delivers strong cross modal generalization.<n>It persistently fails at compositional reasoning over objects, attributes, and relations.<n>We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment.
- Score: 12.946160260124378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal generalization by aligning images and texts in a shared embedding space, yet it persistently fails at compositional reasoning over objects, attributes, and relations often behaving like a bag-of-words matcher. Prior causal accounts typically model text as a single vector, obscuring token-level structure and leaving core phenomena-such as prompt sensitivity and failures on hard negatives unexplained. We address this gap with a token-aware causal representation learning (CRL) framework grounded in a sequential, language-token SCM. Our theory extends block identifiability to tokenized text, proving that CLIP's contrastive objective can recover the modal-invariant latent variable under both sentence-level and token-level SCMs. Crucially, token granularity yields the first principled explanation of CLIP's compositional brittleness: composition nonidentifiability. We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations over atomic concepts, thereby failing to distinguish correct captions from hard negatives despite optimizing the same training objective as true-optimal encoders. The analysis further links language-side nonidentifiability to visual-side failures via the modality gap and shows how iterated composition operators compound hardness, motivating improved negative mining strategies.
Related papers
- Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning [23.10421006625293]
Vision-Language Models (VLMs) like CLIP struggle to understand negation.<n>Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting.<n>We propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions.
arXiv Detail & Related papers (2026-02-24T15:55:39Z) - CORE: Context-Robust Remasking for Diffusion Language Models [51.59514489363897]
We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision.<n>Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations.<n>On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
arXiv Detail & Related papers (2026-02-04T00:12:30Z) - ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model [34.90582960625524]
Long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs)<n>We propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images.<n>This substitutes linguistic bias with spatial inductive bias, enabling latent tokens to better capture global reasoning structure.
arXiv Detail & Related papers (2026-01-30T09:06:45Z) - CoT-Seg: Rethinking Segmentation with Chain-of-Thought Reasoning and Self-Correction [50.67483317563736]
This paper aims to explore a system that can think step-by-step, look up information if needed, generate results, self-evaluate its own results, and refine the results.<n>We introduce CoT-Seg, a training-free framework that rethinks reasoning segmentation by combining chain-of-thought reasoning with self-correction.
arXiv Detail & Related papers (2026-01-24T11:41:54Z) - TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models [57.32952956674526]
We introduce TokenSwap, a more evasive and stealthy backdoor attack on large vision-language models (LVLMs)<n>Instead of enforcing a fixed targeted content, TokenSwap subtly disrupts the understanding of object relationships in text.<n> TokenSwap achieves high attack success rates while maintaining superior evasiveness and stealthiness.
arXiv Detail & Related papers (2025-09-29T10:19:22Z) - VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions [16.90061119174727]
We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations.<n> Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs.<n> Secondly, CLIP-IN incorporates long captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP.
arXiv Detail & Related papers (2025-08-04T11:57:10Z) - Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits [15.941209553757274]
Tokenization is the first - and often underappreciated - layer of computation in language models.<n>We show that the success of such reasoning is fundamentally bounded by the structure of tokenized inputs.
arXiv Detail & Related papers (2025-05-20T10:32:30Z) - "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - Enhancing LLM Character-Level Manipulation via Divide and Conquer [74.55804812450164]
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks.<n>They exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution.<n>We propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation.
arXiv Detail & Related papers (2025-02-12T07:37:39Z) - Byte BPE Tokenization as an Inverse string Homomorphism [12.885921620444272]
We show that tokenization acts as an inverse homomorphism between strings and tokens.<n>This suggests that the character space of the source language and the token space of the tokenized language are homomorphic.<n>We also explore the concept of proper tokenization, which refers to an unambiguous tokenization returned from the tokenizer.
arXiv Detail & Related papers (2024-12-04T09:38:11Z) - Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations [43.484570564890866]
Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt.
We present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions.
Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations.
arXiv Detail & Related papers (2024-03-29T17:33:42Z) - Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.