Bootstrapping Vision-Language Learning with Decoupled Language
Pre-training
- URL: http://arxiv.org/abs/2307.07063v4
- Date: Tue, 19 Dec 2023 22:13:30 GMT
- Title: Bootstrapping Vision-Language Learning with Decoupled Language
Pre-training
- Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi
- Abstract summary: We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training.
Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features.
Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
- Score: 46.570154746311935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel methodology aimed at optimizing the application of frozen
large language models (LLMs) for resource-intensive vision-language (VL)
pre-training. The current paradigm uses visual features as prompts to guide
language models, with a focus on determining the most relevant visual features
for corresponding text. Our approach diverges by concentrating on the language
component, specifically identifying the optimal prompts to align with visual
features. We introduce the Prompt-Transformer (P-Former), a model that predicts
these ideal prompts, which is trained exclusively on linguistic data, bypassing
the need for image-text pairings. This strategy subtly bifurcates the
end-to-end VL training process into an additional, separate stage. Our
experiments reveal that our framework significantly enhances the performance of
a robust image-to-text baseline (BLIP-2), and effectively narrows the
performance gap between models trained with either 4M or 129M image-text pairs.
Importantly, our framework is modality-agnostic and flexible in terms of
architectural design, as validated by its successful application in a video
learning task using varied base modules. The code will be made available at
https://github.com/yiren-jian/BLIText.
Related papers
- A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models [17.144311122664508]
A large-scale vision and language model that has been pretrained on massive data encodes visual and linguistic prior.
We propose a chain-of-thought (CoT) meta-learning scheme as a multi-step image captioning procedure to better imitate how humans describe images.
arXiv Detail & Related papers (2025-02-19T18:35:43Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)
We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment [20.953645420787527]
We train a CLIP-like model with only a fraction of the computational cost compared to CLIP.
We achieve state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-12-20T20:46:48Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - UNIMO-2: End-to-End Unified Vision-Language Grounded Learning [46.914284894632]
We propose an end-to-end unified-modal pre-training framework, namely UNIMO-2.
We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts.
Our code and models are public at the UNIMO project page.
arXiv Detail & Related papers (2022-03-17T03:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.