EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
- URL: http://arxiv.org/abs/2502.06788v1
- Date: Mon, 10 Feb 2025 18:59:58 GMT
- Title: EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
- Authors: Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang,
- Abstract summary: Existing encoder-free vision-language models (VLMs) are narrowing the performance gap with their encoder-based counterparts.
We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones.
We show that properly and hierarchically associating vision and language within a unified model reduces interference between modalities.
- Score: 72.07868838411474
- License:
- Abstract: Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. Code is publicly available at: https://github.com/baaivision/EVE.
Related papers
- MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders [28.22099619211775]
Visual encoders are fundamental components in vision-language models (VLMs)
Recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost.
We present a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model.
arXiv Detail & Related papers (2025-01-03T09:10:34Z) - SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.
We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.
Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z) - Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process.
Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models [26.88977803220915]
We propose an efficient and robust method for updating vision encoders within vision language models.
Our approach selectively and locally updates encoders, leading to substantial performance improvements on data where previous mistakes occurred.
arXiv Detail & Related papers (2024-07-23T14:39:40Z) - Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z) - BRAVE: Broadening the visual encoding of vision-language models [48.41146184575914]
Vision-language models (VLMs) are composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks.
Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders.
We introduce BRAVE, which consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM.
arXiv Detail & Related papers (2024-04-10T17:59:45Z) - Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for
Vision-Language Tasks [118.49566068398642]
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets.
Unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions.
We propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
arXiv Detail & Related papers (2022-04-22T04:41:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.