Declaration-based Prompt Tuning for Visual Question Answering
- URL: http://arxiv.org/abs/2205.02456v1
- Date: Thu, 5 May 2022 05:56:55 GMT
- Title: Declaration-based Prompt Tuning for Visual Question Answering
- Authors: Yuhang Liu, Wei Wei, Daowan Peng and Feida Zhu
- Abstract summary: We propose an innovative visual-language (VL) fine-tuning paradigm (named Declaration-based Prompt Tuning, abbreviated as DPT)
DPT jointly optimize the objectives of pre-training and fine-tuning of VQA model, boosting the effective adaptation of pre-trained VL models to the downstream task.
Experimental results on GQA dataset show that DPT outperforms the fine-tuned counterpart by a large margin regarding accuracy in both fully-supervised (2.68%) and zero-shot/few-shot (over 31%) settings.
- Score: 16.688288454811016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the pre-training-then-fine-tuning paradigm has yielded
immense success on a wide spectrum of cross-modal tasks, such as visual
question answering (VQA), in which a visual-language (VL) model is first
optimized via self-supervised task objectives, e.g., masked language modeling
(MLM) and image-text matching (ITM), and then fine-tuned to adapt to downstream
task (e.g., VQA) via a brand-new objective function, e.g., answer prediction.
The inconsistency of the objective forms not only severely limits the
generalization of pre-trained VL models to downstream tasks, but also requires
a large amount of labeled data for fine-tuning. To alleviate the problem, we
propose an innovative VL fine-tuning paradigm (named Declaration-based Prompt
Tuning, abbreviated as DPT), which jointly optimizes the objectives of
pre-training and fine-tuning of VQA model, boosting the effective adaptation of
pre-trained VL models to the downstream task. Specifically, DPT reformulates
the objective form of VQA task via (1) textual adaptation, which converts the
given questions into declarative sentence-form for prompt-tuning, and (2) task
adaptation, which optimizes the objective function of VQA problem in the manner
of pre-training phase. Experimental results on GQA dataset show that DPT
outperforms the fine-tuned counterpart by a large margin regarding accuracy in
both fully-supervised (2.68%) and zero-shot/few-shot (over 31%) settings. All
the data and codes will be available to facilitate future research.
Related papers
- One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering [31.025439143093585]
Vision-Language Models (VLMs) have shown significant promise in Visual Question Answering (VQA) tasks by leveraging web-scale multimodal datasets.
These models often struggle with continual learning due to catastrophic forgetting when adapting to new tasks.
We propose the first data-free method that leverages the language generation capability of a VLM, instead of relying on external models.
arXiv Detail & Related papers (2024-11-04T16:04:59Z) - RAVEN: Multitask Retrieval Augmented Vision-Language Learning [5.1583788731239455]
The scaling of large language models to encode all the world's knowledge is unsustainable and has exacerbated resource barriers.
Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored.
This paper introduces RAVEN, a retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning.
arXiv Detail & Related papers (2024-06-27T13:08:35Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Task Residual for Tuning Vision-Language Models [69.22958802711017]
We propose a new efficient tuning approach for vision-language models (VLMs) named Task Residual Tuning (TaskRes)
TaskRes explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
The proposed TaskRes is simple yet effective, which significantly outperforms previous methods on 11 benchmark datasets.
arXiv Detail & Related papers (2022-11-18T15:09:03Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models [101.5066760592534]
We present Cross-modal Prompt Tuning (CPT), a novel paradigm for tuning Vision-Language Models (VL-PTMs)
CPT reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap.
Comprehensive experimental results show that prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin.
arXiv Detail & Related papers (2021-09-24T08:07:29Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.