Related papers: Rethinking Causal Mask Attention for Vision-Language Inference

Rethinking Causal Mask Attention for Vision-Language Inference

URL: http://arxiv.org/abs/2505.18605v1
Date: Sat, 24 May 2025 08:59:28 GMT
Title: Rethinking Causal Mask Attention for Vision-Language Inference
Authors: Xiaohuan Pei, Tao Huang, YanXiang Ma, Chang Xu,
Abstract summary: We investigate how different causal masking strategies affect vision-language inference.<n>We propose a family of future-aware attentions tailored for this setting.<n>We show that selectively compressing future semantic context into past representations benefits the inference.
Score: 17.450072268270773
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Causal attention has become a foundational mechanism in autoregressive vision-language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the model's ability to leverage future context that often contains essential semantic cues for accurate inference. In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting. We first empirically analyze the effect of previewing future tokens for vision queries and demonstrate that rigid masking undermines the model's capacity to capture useful contextual semantic representations. Based on these findings, we propose a lightweight attention family that aggregates future visual context into past representations via pooling, effectively preserving the autoregressive structure while enhancing cross-token dependencies. We evaluate a range of causal masks across diverse vision-language inference settings and show that selectively compressing future semantic context into past representations benefits the inference.

Related papers

Vision and Intention Boost Large Language Model in Long-Term Action Anticipation [39.66216219048517]
Long-term action anticipation aims to predict future actions over an extended period.<n>Recent researches leverage large language models (LLMs) by utilizing text-based inputs which suffer severe information loss.<n>We propose a novel Intention-Conditioned Vision-Language (ICVL) model in this study that fully leverages the rich semantic information of visual data.
arXiv Detail & Related papers (2025-05-03T06:33:54Z)
Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models [93.46875303598577]
Vision-language models (VLMs) have advanced rapidly in processing multimodal information, but their ability to reconcile conflicting signals remains underexplored.<n>This work investigates how VLMs process ASCII art, a unique medium where textual elements collectively form visual patterns, potentially creating semantic-visual conflicts.
arXiv Detail & Related papers (2025-04-02T10:47:07Z)
FutureVision: A methodology for the investigation of future cognition [0.5644620681963636]
We conduct a pilot study examining how visual fixation patterns vary during the evaluation of futuristic scenarios.<n>Preliminary results show that far-future and pessimistic scenarios are associated with longer fixations and more erratic saccades.
arXiv Detail & Related papers (2025-02-03T18:29:06Z)
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models [58.936893810674896]
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems.<n>We introduce a multimodal large language model framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS)<n>We propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images.
arXiv Detail & Related papers (2025-01-03T09:25:04Z)
Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.<n>By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z)
Towards Interpreting Visual Information Processing in Vision-Language Models [24.51408101801313]
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images.<n>We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM.
arXiv Detail & Related papers (2024-10-09T17:55:02Z)
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.