Related papers: Cross-modal Information Flow in Multimodal Large Language Models

Cross-modal Information Flow in Multimodal Large Language Models

URL: http://arxiv.org/abs/2411.18620v2
Date: Tue, 25 Mar 2025 18:59:50 GMT
Title: Cross-modal Information Flow in Multimodal Large Language Models
Authors: Zhi Zhang, Srishti Yadav, Fengze Han, Ekaterina Shutova,
Abstract summary: We investigate the information flow between different modalities -- language and vision -- in large language models (MLLMs)<n>We find that there are two distinct stages in the process of integration of the two modalities.<n>Our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs.
Score: 14.853197288189579
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent advancements in auto-regressive multimodal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities -- language and vision -- in MLLMs, focusing on visual question answering. Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the input sequence for the final prediction. Overall, our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs, thereby facilitating future research into multimodal information localization and editing. Our code and collected dataset are released here: https://github.com/FightingFighting/cross-modal-information-flow-in-MLLM.git.

Related papers

Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models [58.91911788912665]
We propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discrimi visual representations.<n>Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information.
arXiv Detail & Related papers (2025-12-06T04:20:13Z)
Visual Representations inside the Language Model [36.35124375782294]
We study flow of visual information through the language model, finding that image value tokens encode sufficient information to perform several perception-heavy tasks.<n>We find that while the language model does augment the visual information received from the projection of input visual encodings, it contains less visual information on several tasks than the equivalent visual encoder (SigLIP)<n>Next, we discuss controlling visual information in the language model, showing that adding a text prefix to the image input improves perception capabilities of visual representations.
arXiv Detail & Related papers (2025-10-06T14:01:39Z)
Exploring Vision Language Models for Multimodal and Multilingual Stance Detection [9.079302402271491]
Social media's global reach amplifies the spread of information, highlighting the need for robust Natural Language Processing tasks. Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios relatively underexplored. This paper evaluates state-of-the-art Vision-Language Models (VLMs) on multimodal and multilingual stance detection tasks.
arXiv Detail & Related papers (2025-01-29T13:39:53Z)
Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges [15.850548556536538]
Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language. An advanced subset of these models, known as Multimodal Large Language Models (MLLMs), extends LLM capabilities to process and interpret multiple data modalities. This survey provides a comprehensive overview of the recent advancements in LLMs.
arXiv Detail & Related papers (2024-12-04T11:14:06Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
Towards Interpreting Visual Information Processing in Vision-Language Models [24.51408101801313]
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM.
arXiv Detail & Related papers (2024-10-09T17:55:02Z)
Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM) By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis. By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM) It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z)
Understanding Information Storage and Transfer in Multi-modal Large Language Models [51.20840103605018]
We study how Multi-modal Large Language Models process information in a factual visual question answering task. Key findings show that these MLLMs rely on self-attention blocks in much earlier layers for information storage. We introduce MultEdit, a model-editing algorithm that can correct errors and insert new long-tailed information into MLLMs.
arXiv Detail & Related papers (2024-06-06T16:35:36Z)
From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks [33.476693301050275]
We conduct experiments with truncation strategies across various LVLMs for visual question answering and image captioning tasks. By exploring the information flow from the perspective of visual representation contribution, we observe that it tends to converge in shallow layers but diversify in deeper layers.
arXiv Detail & Related papers (2024-06-04T13:52:54Z)
Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information. In this study, we find that the intermediate layers of models can encode more global semantic information. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z)
Veagle: Advancements in Multimodal Representation Learning [0.0]
This paper introduces a novel approach to enhance the multimodal capabilities of existing models. Our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works. Our results indicate a improvement of 5-6 % in performance, with Veagle outperforming existing models by a notable margin.
arXiv Detail & Related papers (2024-01-18T12:45:25Z)
Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions. We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP. We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.