Language Grounded QFormer for Efficient Vision Language Understanding
- URL: http://arxiv.org/abs/2311.07449v1
- Date: Mon, 13 Nov 2023 16:30:49 GMT
- Title: Language Grounded QFormer for Efficient Vision Language Understanding
- Authors: Moulik Choraria, Nitesh Sekhar, Yue Wu, Xu Zhang, Prateek Singhal, Lav
R. Varshney
- Abstract summary: We take inspiration from the Query Transformer (QFormer) approach proposed in BLIP-2 models for bridging frozen modalities.
We propose a more efficient method for QFormer-based vision-language alignment.
- Score: 25.432918254523344
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale pretraining and instruction tuning have been successful for
training general-purpose language models with broad competencies. However,
extending to general-purpose vision-language models is challenging due to the
distributional diversity in visual inputs. A recent line of work explores
vision-language instruction tuning, taking inspiration from the Query
Transformer (QFormer) approach proposed in BLIP-2 models for bridging frozen
modalities. However, these approaches rely heavily on large-scale multi-modal
pretraining for representation learning before eventual finetuning, incurring a
huge computational overhead, poor scaling, and limited accessibility. To that
end, we propose a more efficient method for QFormer-based vision-language
alignment and demonstrate the effectiveness of our strategy compared to
existing baselines in improving the efficiency of vision-language pretraining.
Related papers
- MLAN: Language-Based Instruction Tuning Improves Zero-Shot Generalization of Multimodal Large Language Models [79.0546136194314]
We present a novel instruction tuning recipe to improve the zero-shot task generalization of multimodal large language models.
We evaluate the performance of the proposed approach on 9 unseen datasets across both language and vision modalities.
arXiv Detail & Related papers (2024-11-15T20:09:59Z) - PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter [21.45490901191175]
PaLM2-VAdapter employs a progressively aligned language model as the vision-language adapter.
Our method achieves these advancements with 3070% fewer parameters than the state-of-the-art large vision-language models.
arXiv Detail & Related papers (2024-02-16T18:54:47Z) - Concept-Guided Prompt Learning for Generalization in Vision-Language
Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Bootstrapping Vision-Language Learning with Decoupled Language
Pre-training [46.570154746311935]
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training.
Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features.
Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
arXiv Detail & Related papers (2023-07-13T21:08:15Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Pre-training image-language transformers for open-vocabulary tasks [53.446599611203474]
We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.
We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model.
We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.
arXiv Detail & Related papers (2022-09-09T16:11:11Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.