Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
- URL: http://arxiv.org/abs/2405.17815v2
- Date: Mon, 04 Nov 2024 09:03:31 GMT
- Title: Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
- Authors: Haogeng Liu, Quanzeng You, Xiaotian Han, Yongfei Liu, Huaibo Huang, Ran He, Hongxia Yang,
- Abstract summary: We propose a vision-language connector that enables MLLMs to achieve high accuracy while maintain low cost.
We first reveal the existence of the visual anchors in Vision Transformer and propose a cost-effective search algorithm to extract them.
We introduce the Anchor Former (AcFormer), a novel vision-language connector designed to leverage the rich prior knowledge obtained from these visual anchors during pretraining.
- Score: 82.93634081255942
- License:
- Abstract: In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite its importance, the vision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLMs to achieve high accuracy while maintain low computation cost. We first reveal the existence of the visual anchors in Vision Transformer and propose a cost-effective search algorithm to extract them. Building on these findings, we introduce the Anchor Former (AcFormer), a novel vision-language connector designed to leverage the rich prior knowledge obtained from these visual anchors during pretraining, guiding the aggregation of information. Through extensive experimentation, we demonstrate that the proposed method significantly reduces computational costs by nearly two-thirds compared with baseline, while simultaneously outperforming baseline methods. This highlights the effectiveness and efficiency of AcFormer. Codes are available at https://github.com/liuhaogeng/Anchor-Former.
Related papers
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension [21.500920290909843]
We propose a new pretraining paradigm for Large Language Models (LLMs) to enhance their visual comprehension capabilities.
Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens.
We present a new foundation model called Croc, which achieves new state-of-the-art performance on massive vision-language benchmarks.
arXiv Detail & Related papers (2024-10-18T09:44:25Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Improving Multi-modal Large Language Model through Boosting Vision Capabilities [54.344077285545005]
We focus on improving the visual understanding capability for boosting the vision-language models.
We propose textbfArcana, a multiModal language model, which introduces two crucial techniques.
arXiv Detail & Related papers (2024-10-17T16:36:38Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Towards Multimodal In-Context Learning for Vision & Language Models [21.69457980865084]
State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality.
We propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes.
arXiv Detail & Related papers (2024-03-19T13:53:37Z) - PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter [21.45490901191175]
PaLM2-VAdapter employs a progressively aligned language model as the vision-language adapter.
Our method achieves these advancements with 3070% fewer parameters than the state-of-the-art large vision-language models.
arXiv Detail & Related papers (2024-02-16T18:54:47Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.