A Thousand Words Are Worth More Than a Picture: Natural Language-Centric
Outside-Knowledge Visual Question Answering
- URL: http://arxiv.org/abs/2201.05299v1
- Date: Fri, 14 Jan 2022 04:12:46 GMT
- Title: A Thousand Words Are Worth More Than a Picture: Natural Language-Centric
Outside-Knowledge Visual Question Answering
- Authors: Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, Ying Nian Wu,
Prem Natarajan
- Abstract summary: We call for a paradigm shift for the OK-VQA task, which transforms the image into plain text.
A Transform-Retrieve-Generate framework (TRiG) is proposed, which can be plug-and-played with alternative image-to-text models.
Experimental results show that our TRiG framework outperforms all state-of-the-art supervised methods by at least 11.1% absolute margin.
- Score: 47.1063091195119
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Outside-knowledge visual question answering (OK-VQA) requires the agent to
comprehend the image, make use of relevant knowledge from the entire web, and
digest all the information to answer the question. Most previous works address
the problem by first fusing the image and question in the multi-modal space,
which is inflexible for further fusion with a vast amount of external
knowledge. In this paper, we call for a paradigm shift for the OK-VQA task,
which transforms the image into plain text, so that we can enable knowledge
passage retrieval, and generative question-answering in the natural language
space. This paradigm takes advantage of the sheer volume of gigantic knowledge
bases and the richness of pre-trained language models. A
Transform-Retrieve-Generate framework (TRiG) framework is proposed, which can
be plug-and-played with alternative image-to-text models and textual knowledge
bases. Experimental results show that our TRiG framework outperforms all
state-of-the-art supervised methods by at least 11.1% absolute margin.
Related papers
- Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge [10.074327344317116]
We propose Q&A Prompts to equip AI models with robust cross-modality reasoning ability.
We first use the image-answer pairs and the corresponding questions in a training set as inputs and outputs to train a visual question generation model.
We then use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers.
arXiv Detail & Related papers (2024-01-19T14:22:29Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - Combo of Thinking and Observing for Outside-Knowledge VQA [13.838435454270014]
Outside-knowledge visual question answering is a challenging task that requires both the acquisition and the use of open-ended real-world knowledge.
In this paper, we are inspired to constrain the cross-modality space into the same space of natural-language space.
We propose a novel framework consisting of a multimodal encoder, a textual encoder and an answer decoder.
arXiv Detail & Related papers (2023-05-10T18:32:32Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Image Captioning for Effective Use of Language Models in Knowledge-Based
Visual Question Answering [17.51860125438028]
We propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models.
Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters.
arXiv Detail & Related papers (2021-09-15T14:11:29Z) - External Knowledge Augmented Text Visual Question Answering [0.6445605125467573]
We propose a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks.
We generate results comparable to the state-of-the-art on two publicly available datasets.
arXiv Detail & Related papers (2021-08-22T13:21:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.