Related papers: Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

URL: http://arxiv.org/abs/2506.09047v2
Date: Wed, 11 Jun 2025 11:56:44 GMT
Title: Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Authors: Yaniv Nikankin, Dana Arad, Yossi Gandelsman, Yonatan Belinkov,
Abstract summary: Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs, yet demonstrate higher accuracies when performing an analogous task on text.<n>We investigate this accuracy gap by identifying and comparing the textitcircuits in different modalities.<n>To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers.
Score: 43.94713826224876
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textit{circuits} - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.

Related papers

Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models [24.58621679734274]
Working memory is a central component of intelligent behavior.<n>Recent work has used n-back tasks to probe working-memory-like behavior in large language models.<n>We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids.
arXiv Detail & Related papers (2026-02-04T09:25:07Z)
ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models [7.7352936204066]
We propose a novel, zero-shot method to model visual token pruning as a balance between task relevance and information diversity.<n>Our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss.<n>These gains are accompanied by significant reductions in GPU memory footprint and inference latency.
arXiv Detail & Related papers (2025-10-20T06:18:47Z)
Adding simple structure at inference improves Vision-Language Compositionality [15.785274903236663]
In this paper, we propose to add simple structure at inference, where, given an image and a caption, we divide the image into different smaller crops.<n>We find that our approach consistently improves the performance of evaluated Vision-Language Models without any training.
arXiv Detail & Related papers (2025-06-11T13:06:25Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models [32.33892531885448]
Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks.<n>But their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs.<n>We introduce PAR (Prompt-Aware Token Reduction), a novel and plug-and-play approach that reduces visual tokens efficiently without compromising model performance.
arXiv Detail & Related papers (2024-10-09T07:13:22Z)
Sparsity Meets Similarity: Leveraging Long-Tail Distribution for Dynamic Optimized Token Representation in Multimodal Large Language Models [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved significant success in various tasks.<n>Main computational burden arises from processingd text and visual tokens.<n>We propose a dynamic pruning algorithm that identifies the inflection point in the visual CLS token similarity curve.
arXiv Detail & Related papers (2024-09-02T10:49:10Z)
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment [40.63340635482609]
Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. We advocate for assigning distinct contributions for each text token based on its visual correlation. We introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens.
arXiv Detail & Related papers (2024-05-28T06:44:13Z)
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs [66.05826802808177]
In computer vision, large language models (LLMs) can be used to prime vision-language tasks such as image captioning and visual question answering. We present an experimental evaluation of different interfacing mechanisms, across multiple tasks. We identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a 4x reduction in training time.
arXiv Detail & Related papers (2024-03-20T10:57:17Z)
Distribution-Aware Prompt Tuning for Vision-Language Models [20.02599087680773]
A key to prompt tuning is the feature space alignment between two modalities via learnable vectors with model parameters fixed. Inspired by this observation, we proposed distribution-aware prompt tuning (DAPT) for vision-language models. Our experiments on 11 benchmark datasets demonstrate that our method significantly improves generalizability.
arXiv Detail & Related papers (2023-09-06T23:49:11Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge. We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z)
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces. We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA) Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.