Related papers: Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction

Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction

URL: http://arxiv.org/abs/2310.03291v3
Date: Wed, 21 Feb 2024 09:36:15 GMT
Title: Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction
Authors: Yiren Jian, Tingkai Liu, Yunzhe Tao, Chunhui Zhang, Soroush Vosoughi, Hongxia Yang
Abstract summary: $textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models. We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
Score: 61.16125290912494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce $\text{EVL}_{\text{Gen}}$, a streamlined framework designed for the pre-training of visually conditioned language generation models with high computational demands, utilizing frozen pre-trained large language models (LLMs). The conventional approach in vision-language pre-training (VLP) typically involves a two-stage optimization process: an initial resource-intensive phase dedicated to general-purpose vision-language representation learning, focused on extracting and consolidating relevant visual features. This is followed by a subsequent phase that emphasizes end-to-end alignment between visual and linguistic modalities. Our novel one-stage, single-loss framework bypasses the computationally demanding first training stage by gradually merging similar visual tokens during training, while avoiding model collapse caused by single-stage training of BLIP-2 type models. The gradual merging process effectively condenses visual information while preserving semantic richness, resulting in rapid convergence without compromising performance. Our experimental findings demonstrate that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance. Furthermore, we illustrate that our models significantly narrow the performance gap to current vision-language models using only 1/10 of the data. Finally, we showcase how our image-text models can seamlessly adapt to video-conditioned language generation tasks through novel soft attentive temporal token contextualizing modules. Code is available at \url{https://github.com/yiren-jian/EVLGen}.

Related papers

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension [21.500920290909843]
We propose a new pretraining paradigm for Large Language Models (LLMs) to enhance their visual comprehension capabilities. Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens. We present a new foundation model called Croc, which achieves new state-of-the-art performance on massive vision-language benchmarks.
arXiv Detail & Related papers (2024-10-18T09:44:25Z)
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning [78.19528555505961]
We propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data. The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation. Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets, but can also leverage interleaved pre-training data.
arXiv Detail & Related papers (2024-06-11T17:59:35Z)
VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons. We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z)
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation [35.05755930636518]
We propose ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs. For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model. For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input.
arXiv Detail & Related papers (2023-08-31T12:46:36Z)
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training [46.570154746311935]
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
arXiv Detail & Related papers (2023-07-13T21:08:15Z)
Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models [39.479912987123214]
Self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. We introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept. We show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data.
arXiv Detail & Related papers (2022-10-27T02:57:26Z)
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems. Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition. We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z)
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training. Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.