Bootstrapping Vision-Language Learning with Decoupled Language
Pre-training
- URL: http://arxiv.org/abs/2307.07063v4
- Date: Tue, 19 Dec 2023 22:13:30 GMT
- Title: Bootstrapping Vision-Language Learning with Decoupled Language
Pre-training
- Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi
- Abstract summary: We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training.
Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features.
Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
- Score: 46.570154746311935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel methodology aimed at optimizing the application of frozen
large language models (LLMs) for resource-intensive vision-language (VL)
pre-training. The current paradigm uses visual features as prompts to guide
language models, with a focus on determining the most relevant visual features
for corresponding text. Our approach diverges by concentrating on the language
component, specifically identifying the optimal prompts to align with visual
features. We introduce the Prompt-Transformer (P-Former), a model that predicts
these ideal prompts, which is trained exclusively on linguistic data, bypassing
the need for image-text pairings. This strategy subtly bifurcates the
end-to-end VL training process into an additional, separate stage. Our
experiments reveal that our framework significantly enhances the performance of
a robust image-to-text baseline (BLIP-2), and effectively narrows the
performance gap between models trained with either 4M or 129M image-text pairs.
Importantly, our framework is modality-agnostic and flexible in terms of
architectural design, as validated by its successful application in a video
learning task using varied base modules. The code will be made available at
https://github.com/yiren-jian/BLIText.
Related papers
- VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - ViLTA: Enhancing Vision-Language Pre-training through Textual
Augmentation [35.05755930636518]
We propose ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs.
For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model.
For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input.
arXiv Detail & Related papers (2023-08-31T12:46:36Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - UNIMO-2: End-to-End Unified Vision-Language Grounded Learning [46.914284894632]
We propose an end-to-end unified-modal pre-training framework, namely UNIMO-2.
We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts.
Our code and models are public at the UNIMO project page.
arXiv Detail & Related papers (2022-03-17T03:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.