MouSi: Poly-Visual-Expert Vision-Language Models
- URL: http://arxiv.org/abs/2401.17221v1
- Date: Tue, 30 Jan 2024 18:09:11 GMT
- Title: MouSi: Poly-Visual-Expert Vision-Language Models
- Authors: Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song,
Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Caishuang Huang,
Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang Yan, Tao Gui,
Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
- Abstract summary: This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
- Score: 132.58949014605477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current large vision-language models (VLMs) often encounter challenges such
as insufficient capabilities of a single visual component and excessively long
visual tokens. These issues can limit the model's effectiveness in accurately
interpreting complex visual information and over-lengthy contextual
information. Addressing these challenges is crucial for enhancing the
performance and applicability of VLMs. This paper proposes the use of ensemble
experts technique to synergizes the capabilities of individual visual encoders,
including those skilled in image-text matching, OCR, image segmentation, etc.
This technique introduces a fusion network to unify the processing of outputs
from different visual experts, while bridging the gap between image encoders
and pre-trained LLMs. In addition, we explore different positional encoding
schemes to alleviate the waste of positional encoding caused by lengthy image
feature sequences, effectively addressing the issue of position overflow and
length limitations. For instance, in our implementation, this technique
significantly reduces the positional occupancy in models like SAM, from a
substantial 4096 to a more efficient and manageable 64 or even down to 1.
Experimental results demonstrate that VLMs with multiple experts exhibit
consistently superior performance over isolated visual encoders and mark a
significant performance boost as more experts are integrated. We have
open-sourced the training code used in this report. All of these resources can
be found on our project website.
Related papers
- Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.38717274524681]
This study explores the design space for multimodal large language models (MLLMs) using a mixture of vision encoders and resolutions.
Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach.
The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z) - MoVA: Adapting Mixture of Vision Experts to Multimodal Context [38.8308841469793]
We propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism.
In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts.
In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge.
arXiv Detail & Related papers (2024-04-19T17:59:48Z) - BRAVE: Broadening the visual encoding of vision-language models [48.41146184575914]
Vision-language models (VLMs) are composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks.
Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders.
We introduce BRAVE, which consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM.
arXiv Detail & Related papers (2024-04-10T17:59:45Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Question Aware Vision Transformer for Multimodal Reasoning [14.188369270753347]
We introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal reasoning.
It embeds question awareness directly within the vision encoder.
This integration results in dynamic visual features focusing on relevant image aspects to the posed question.
arXiv Detail & Related papers (2024-02-08T08:03:39Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.