Rethinking Visual Information Processing in Multimodal LLMs
- URL: http://arxiv.org/abs/2511.10301v1
- Date: Fri, 14 Nov 2025 01:44:22 GMT
- Title: Rethinking Visual Information Processing in Multimodal LLMs
- Authors: Dongwan Kim, Viresh Ranjan, Takashi Nagata, Arnab Dhua, Amit Kumar K C,
- Abstract summary: We present LLaViT - Large Language Models as extended Vision Transformers.<n>We show that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks.
- Score: 9.660144531857933
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.
Related papers
- From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z) - Attention Guided Alignment in Efficient Vision-Language Models [56.20286899428444]
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs)<n>This paper presents a comprehensive analysis of attention patterns in efficient VLMs.<n>We introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers.
arXiv Detail & Related papers (2025-11-21T21:36:48Z) - Visual Representation Alignment for Multimodal Large Language Models [38.319869213758686]
Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks.<n>But they remain limited in vision-centric tasks such as object counting or spatial reasoning.<n>We present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models.
arXiv Detail & Related papers (2025-09-09T17:59:14Z) - MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models [0.09895793818721334]
We introduce Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (MASSV)<n>MASSV transforms existing small language models into effective multimodal drafters through a two-phase approach.<n>Experiments across the Qwen2.5-VL and Gemma3 model families demonstrate that MASSV increases accepted length by up to 30% and delivers end-to-end inference speedups of up to 1.46x on visually-grounded tasks.
arXiv Detail & Related papers (2025-05-15T17:37:00Z) - LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models [9.660892239615364]
This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO.<n>Leo is a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling.<n>We show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe.
arXiv Detail & Related papers (2025-01-13T00:29:55Z) - Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation [109.5893580175657]
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data.<n>We propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's hidden representations.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion [40.56646959926701]
Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models.<n>Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders.<n>We introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs.
arXiv Detail & Related papers (2024-12-02T09:02:28Z) - LLaVA-OneVision: Easy Visual Task Transfer [79.36225099277112]
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations.
Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs.
arXiv Detail & Related papers (2024-08-06T17:59:44Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.