Zero-shot Visual Question Answering with Language Model Feedback
- URL: http://arxiv.org/abs/2305.17006v1
- Date: Fri, 26 May 2023 15:04:20 GMT
- Title: Zero-shot Visual Question Answering with Language Model Feedback
- Authors: Yifan Du, Junyi Li, Tianyi Tang, Wayne Xin Zhao, Ji-Rong Wen
- Abstract summary: We propose a language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA)
Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM)
- Score: 83.65140324876536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel language model guided captioning approach,
LAMOC, for knowledge-based visual question answering (VQA). Our approach
employs the generated captions by a captioning model as the context of an
answer prediction model, which is a Pre-trained Language model (PLM). As the
major contribution, we leverage the guidance and feedback of the prediction
model to improve the capability of the captioning model. In this way, the
captioning model can become aware of the task goal and information need from
the PLM. To develop our approach, we design two specific training stages, where
the first stage adapts the captioning model to the prediction model (selecting
more suitable caption propositions for training) and the second stage tunes the
captioning model according to the task goal (learning from feedback of the
PLM). Extensive experiments demonstrate the effectiveness of the proposed
approach on the knowledge-based VQA task. Specifically, on the challenging
A-OKVQA dataset, LAMOC outperforms several competitive zero-shot methods and
even achieves comparable results to a fine-tuned VLP model. Our code is
publicly available at https://github.com/RUCAIBox/LAMOC.
Related papers
- Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - How to Adapt Pre-trained Vision-and-Language Models to a Text-only
Input? [0.13706331473063876]
We focus on pre-trained multimodal vision-and-language (VL) models for which there already are some results on their language understanding capabilities.
An unresolved issue with evaluating the linguistic skills of these models is that there is no established method for adapting them to text-only input without out-of-distribution uncertainty.
Our evaluations on both GLUE and Visual Property Norms (VPN) show that care should be put into adapting VL models to zero-shot text-only tasks, while the models are less sensitive to how we adapt them to non-zero-shot tasks.
arXiv Detail & Related papers (2022-09-19T13:00:12Z) - Enhancing Pre-trained Models with Text Structure Knowledge for Question
Generation [2.526624977753083]
We model text structure as answer position and syntactic dependency, and propose answer localness modeling and syntactic mask attention to address these limitations.
Experiments on SQuAD dataset show that our proposed two modules improve performance over the strong pre-trained model ProphetNet.
arXiv Detail & Related papers (2022-09-09T08:33:47Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - CaMEL: Mean Teacher Learning for Image Captioning [47.9708610052655]
We present CaMEL, a novel Transformer-based architecture for image captioning.
Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase.
Experimentally, we assess the effectiveness of the proposed solution on the COCO dataset and in conjunction with different visual feature extractors.
arXiv Detail & Related papers (2022-02-21T19:04:46Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.