Dense Connector for MLLMs
- URL: http://arxiv.org/abs/2405.13800v2
- Date: Thu, 14 Nov 2024 23:53:05 GMT
- Title: Dense Connector for MLLMs
- Authors: Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang,
- Abstract summary: We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
- Score: 89.50595155217108
- License:
- Abstract: Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse architectures of MLLMs (e.g., LLaVA-v1.5, LLaVA-NeXT and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development. Code is available at https://github.com/HJYao00/DenseConnector .
Related papers
- InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.
We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.
Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z) - LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models [9.660892239615364]
This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO.
Leo is a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling.
We show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe.
arXiv Detail & Related papers (2025-01-13T00:29:55Z) - OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.
We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.
We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning [19.68349294206012]
We propose a training-free adaptive inference method for multi-modal LLMs.
With a minimalist design, our method can be applied to both video and image LLMs.
Under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding.
arXiv Detail & Related papers (2024-12-04T11:47:57Z) - Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion [40.56646959926701]
Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models.
Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders.
We introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs.
arXiv Detail & Related papers (2024-12-02T09:02:28Z) - Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.38717274524681]
This study explores the design space for multimodal large language models (MLLMs) using a mixture of vision encoders and resolutions.
Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach.
The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z) - OneLLM: One Framework to Align All Modalities with Language [86.8818857465443]
We present OneLLM, an MLLM that aligns eight modalities to language using a unified framework.
OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning.
arXiv Detail & Related papers (2023-12-06T18:59:19Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z) - InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.