Multimodal Function Vectors for Spatial Relations
- URL: http://arxiv.org/abs/2510.02528v1
- Date: Thu, 02 Oct 2025 19:55:56 GMT
- Title: Multimodal Function Vectors for Spatial Relations
- Authors: Shuhao Fu, Esther Goldberg, Ying Nian Wu, Hongjing Lu,
- Abstract summary: We show that a small subset of attention heads in the vision-language model OpenFlamingo-4B is responsible for transmitting representations of spatial relations.<n>The activations of these attention heads, termed function vectors, can be extracted and manipulated to alter an LMM's performance on relational tasks.
- Score: 33.20813174218433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Multimodal Models (LMMs) demonstrate impressive in-context learning abilities from limited multimodal demonstrations, yet the internal mechanisms supporting such task learning remain opaque. Building on prior work of large language models, we show that a small subset of attention heads in the vision-language model OpenFlamingo-4B is responsible for transmitting representations of spatial relations. The activations of these attention heads, termed function vectors, can be extracted and manipulated to alter an LMM's performance on relational tasks. First, using both synthetic and real image datasets, we apply causal mediation analysis to identify attention heads that strongly influence relational predictions, and extract multimodal function vectors that improve zero-shot accuracy at inference time. We further demonstrate that these multimodal function vectors can be fine-tuned with a modest amount of training data, while keeping LMM parameters frozen, to significantly outperform in-context learning baselines. Finally, we show that relation-specific function vectors can be linearly combined to solve analogy problems involving novel and untrained spatial relations, highlighting the strong generalization ability of this approach. Our results show that LMMs encode spatial relational knowledge within localized internal structures, which can be systematically extracted and optimized, thereby advancing our understanding of model modularity and enhancing control over relational reasoning in LMMs.
Related papers
- Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer [65.72553715508691]
We show that large vision-language models (LVLMs) lag behind strong text-only large language models (LLMs) on tasks that require multi-step inference and compositional decision-making.<n>We propose Shared Neuron Low-Rank Fusion (SNRF), a parameter-efficient framework that transfers mature inference circuitry from LLMs to LVLMs.<n>Our results demonstrate that shared neurons form an interpretable bridge between LLMs and LVLMs, enabling low-cost transfer of inference ability into multimodal models.
arXiv Detail & Related papers (2026-02-22T06:04:05Z) - Relational Knowledge Distillation Using Fine-tuned Function Vectors [36.277498272417965]
Representing relations between concepts is a core prerequisite for intelligent systems to make sense of the world.<n>Recent work using causal mediation analysis has shown that a small set of attention heads encodes task representation in in-context learning.<n>We show that fine-tuning function vectors with only a small set of examples yields better performance on relation-based word-completion tasks.
arXiv Detail & Related papers (2026-01-13T03:02:18Z) - Foundation Model for Skeleton-Based Human Action Understanding [56.89025287217221]
This paper presents a Unified Skeleton-based Dense Representation Learning framework.<n>USDRL consists of a Transformer-based Dense Spatio-Temporal (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT)
arXiv Detail & Related papers (2025-08-18T02:42:16Z) - Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation [89.5123417007126]
We show how to make Large Multimodal Models (LMMs) understand the spatial action space.<n>We also show how to fully exploit the reasoning capacity of LMMs in solving these tasks.<n>Our resulting reasoning model built upon a 7B backbone, named ReasonManip, demonstrates three notable advantages.
arXiv Detail & Related papers (2025-05-19T06:00:14Z) - LLM Enhancers for GNNs: An Analysis from the Perspective of Causal Mechanism Identification [19.389891710579022]
We study the use of large language models (LLMs) as feature enhancers to optimize node representations, which are then used as inputs for graph neural networks (GNNs)<n>Building on the analytical results, we design a plug-and-play optimization module to improve the information transfer between LLM enhancers and GNNs.
arXiv Detail & Related papers (2025-05-13T06:29:25Z) - Large Multi-modal Models Can Interpret Features in Large Multi-modal Models [51.485491249693155]
We first apply a Sparse Autoencoder to disentangle the representations into human understandable features.<n>We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves.
arXiv Detail & Related papers (2024-11-22T14:41:36Z) - Interpreting and Improving Large Language Models in Arithmetic Calculation [72.19753146621429]
Large language models (LLMs) have demonstrated remarkable potential across numerous applications.
In this work, we delve into uncovering a specific mechanism by which LLMs execute calculations.
We investigate the potential benefits of selectively fine-tuning these essential heads/MLPs to boost the LLMs' computational performance.
arXiv Detail & Related papers (2024-09-03T07:01:46Z) - Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models [42.17166746027585]
We introduce a bidirectional weighted graph-based framework to learn factorized attributes and their interrelations within complex data.
Specifically, we propose a $beta$-VAE based module to extract factors as the initial nodes of the graph.
By integrating these complementary modules, our model successfully achieves fine-grained, practical and unsupervised disentanglement.
arXiv Detail & Related papers (2024-07-26T15:32:21Z) - F-LMM: Grounding Frozen Large Multimodal Models [53.8059045627934]
We present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations.<n>It is based on the fact that word-pixel correspondences conducive to visual grounding inherently exist in the attention mechanism of well-trained LMMs.<n>It achieves competitive performance on referring expression segmentation and panoptic narrative grounding benchmarks.
arXiv Detail & Related papers (2024-06-09T15:14:26Z) - Towards Modeling Learner Performance with Large Language Models [7.002923425715133]
This paper investigates whether the pattern recognition and sequence modeling capabilities of LLMs can be extended to the domain of knowledge tracing.
We compare two approaches to using LLMs for this task, zero-shot prompting and model fine-tuning, with existing, non-LLM approaches to knowledge tracing.
While LLM-based approaches do not achieve state-of-the-art performance, fine-tuned LLMs surpass the performance of naive baseline models and perform on par with standard Bayesian Knowledge Tracing approaches.
arXiv Detail & Related papers (2024-02-29T14:06:34Z) - Chain-of-Thought Prompt Distillation for Multimodal Named Entity
Recognition and Multimodal Relation Extraction [8.169359626365619]
We generate a textitchain of thought (CoT) -- a sequence of intermediate reasoning steps.
We present a novel conditional prompt distillation method to assimilate the commonsense reasoning ability from large language models.
Our approach attains state-of-the-art accuracy and manifests a plethora of advantages concerning interpretability, data efficiency, and cross-domain generalization.
arXiv Detail & Related papers (2023-06-25T04:33:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.