Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion
- URL: http://arxiv.org/abs/2412.01289v2
- Date: Wed, 04 Dec 2024 09:51:16 GMT
- Title: Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion
- Authors: Zhuokun Chen, Jinwu Hu, Zeshuai Deng, Yufeng Wang, Bohan Zhuang, Mingkui Tan,
- Abstract summary: Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models.<n>Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders.<n>We introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs.
- Score: 40.56646959926701
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.
Related papers
- Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models [53.06230963851451]
JARVIS is a JEPA-inspired framework for self-supervised visual enhancement in MLLMs.<n>We introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs.
arXiv Detail & Related papers (2025-12-17T19:01:34Z) - Rethinking Visual Information Processing in Multimodal LLMs [9.660144531857933]
We present LLaViT - Large Language Models as extended Vision Transformers.<n>We show that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks.
arXiv Detail & Related papers (2025-11-13T13:36:30Z) - Vision-Centric Activation and Coordination for Multimodal Large Language Models [42.26911585599856]
Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities.<n>However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information.<n>We introduce VaCo, which optimize MLLM representations through Vision-Centric activation and Coordination.
arXiv Detail & Related papers (2025-10-16T06:38:39Z) - LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models [9.660892239615364]
This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO.
Leo is a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling.
We show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe.
arXiv Detail & Related papers (2025-01-13T00:29:55Z) - Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation [109.5893580175657]
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data.<n>We propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's hidden representations.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.38717274524681]
This study explores the design space for multimodal large language models (MLLMs) using a mixture of vision encoders and resolutions.
Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach.
The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z) - SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs [40.74693126923826]
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities.
Training adapters with image-level supervision often results in significant misalignment.
We introduce Supervised Embedding Alignment (SEA), a token-level alignment method that leverages vision-language pre-trained models.
arXiv Detail & Related papers (2024-08-21T17:58:02Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage
and Sharing in LLMs [72.49064988035126]
We propose an approach called MKS2, aimed at enhancing multimodal large language models (MLLMs)
Specifically, we introduce the Modular Visual Memory, a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently.
Our experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge.
arXiv Detail & Related papers (2023-11-27T12:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.