From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
Models
- URL: http://arxiv.org/abs/2310.08825v3
- Date: Fri, 8 Mar 2024 02:49:12 GMT
- Title: From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
Models
- Authors: Dongsheng Jiang, Yuchen Liu, Songlin Liu, Jin'e Zhao, Hao Zhang, Zhen
Gao, Xiaopeng Zhang, Jin Li, Hongkai Xiong
- Abstract summary: We investigate the effectiveness of different vision encoders within Large Language Models (MLLMs)
Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding.
We propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging.
- Score: 36.41816380074965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal Large Language Models (MLLMs) have made significant strides in
expanding the capabilities of Large Language Models (LLMs) through the
incorporation of visual perception interfaces. Despite the emergence of
exciting applications and the availability of diverse instruction tuning data,
existing approaches often rely on CLIP or its variants as the visual branch,
and merely extract features from the deep layers. However, these methods lack a
comprehensive analysis of the visual encoders in MLLMs. In this paper, we
conduct an extensive investigation into the effectiveness of different vision
encoders within MLLMs. Our findings reveal that the shallow layer features of
CLIP offer particular advantages for fine-grained tasks such as grounding and
region understanding. Surprisingly, the vision-only model DINO, which is not
pretrained with text-image alignment, demonstrates promising performance as a
visual branch within MLLMs. By simply equipping it with an MLP layer for
alignment, DINO surpasses CLIP in fine-grained related perception tasks.
Building upon these observations, we propose a simple yet effective feature
merging strategy, named COMM, that integrates CLIP and DINO with Multi-level
features Merging, to enhance the visual capabilities of MLLMs. We evaluate COMM
through comprehensive experiments on a wide range of benchmarks, including
image captioning, visual question answering, visual grounding, and object
hallucination. Experimental results demonstrate the superior performance of
COMM compared to existing methods, showcasing its enhanced visual capabilities
within MLLMs.
Related papers
- Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Visualization Literacy of Multimodal Large Language Models: A Comparative Study [12.367399155606162]
multimodal large language models (MLLMs) combine the inherent power of large language models (LLMs) with the renewed capabilities to reason about the multimodal context.
Many recent works in visualization have demonstrated MLLMs' capability to understand and interpret visualization results and explain the content of the visualization to users in natural language.
In this work, we aim to fill the gap by utilizing the concept of visualization literacy to evaluate MLLMs.
arXiv Detail & Related papers (2024-06-24T17:52:16Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.