Related papers: Unveiling Encoder-Free Vision-Language Models

Unveiling Encoder-Free Vision-Language Models

URL: http://arxiv.org/abs/2406.11832v1
Date: Mon, 17 Jun 2024 17:59:44 GMT
Title: Unveiling Encoder-Free Vision-Language Models
Authors: Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang,
Abstract summary: Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
Score: 62.52803514667452
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. It significantly outperforms the counterpart Fuyu-8B with mysterious training procedures and undisclosed training data. We believe that EVE provides a transparent and efficient route for developing a pure decoder-only architecture across modalities. Our code and models are publicly available at: https://github.com/baaivision/EVE.

Related papers

Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z)
Zero-Shot Vision Encoder Grafting via LLM Surrogates [65.37227522413689]
Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM)<n>We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM.<n> Vision encoders trained on the surrogate can then be directly transferred to the larger model.
arXiv Detail & Related papers (2025-05-28T17:59:59Z)
BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries [37.37905881898424]
multimodal large language models (MLLMs) eliminate the need for a well-trained vision encoder by directly processing image tokens before the language model. The absence of a vision encoder implies that the model is likely to rely on substantial data to learn the necessary visual-semantic alignments. We present BREEN, a data-efficient encoder-free multimodal architecture that mitigates this issue.
arXiv Detail & Related papers (2025-03-16T10:43:14Z)
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models [72.07868838411474]
Existing encoder-free vision-language models (VLMs) are narrowing the performance gap with their encoder-based counterparts. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. We show that properly and hierarchically associating vision and language within a unified model reduces interference between modalities.
arXiv Detail & Related papers (2025-02-10T18:59:58Z)
Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills. To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding. We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. These models have been shown to be highly capable, but also lacking some basic visual understanding skills. This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z)
Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models [26.88977803220915]
We propose an efficient and robust method for updating vision encoders within vision language models. Our approach selectively and locally updates encoders, leading to substantial performance improvements on data where previous mistakes occurred.
arXiv Detail & Related papers (2024-07-23T14:39:40Z)
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm. We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z)
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE [66.48689706116808]
Efficient Vision-languagE is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Eve encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts. Eve achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.
arXiv Detail & Related papers (2023-08-23T07:36:30Z)
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [81.32968995346775]
VisionLLM is a framework for vision-centric tasks that can be flexibly defined and managed using language instructions. Our model can achieve over 60% mAP on COCO, on par with detection-specific models.
arXiv Detail & Related papers (2023-05-18T17:59:42Z)
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks [118.49566068398642]
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets. Unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions. We propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
arXiv Detail & Related papers (2022-04-22T04:41:04Z)
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts [46.55920956687346]
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks. We propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs.
arXiv Detail & Related papers (2021-11-03T17:20:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.