Multiple-Question Multiple-Answer Text-VQA
- URL: http://arxiv.org/abs/2311.08622v1
- Date: Wed, 15 Nov 2023 01:00:02 GMT
- Title: Multiple-Question Multiple-Answer Text-VQA
- Authors: Peng Tang, Srikar Appalaraju, R. Manmatha, Yusheng Xie, Vijay
Mahadevan
- Abstract summary: Multiple-Question Multiple-Answer (MQMA) is a novel approach to do text-VQA in encoder-decoder transformer models.
MQMA takes multiple questions and content as input at the encoder and predicts multiple answers at the decoder in an auto-regressive manner.
We propose a novel MQMA denoising pre-training task which is designed to teach the model to align and delineate multiple questions and content with associated answers.
- Score: 19.228969692887603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do
text-VQA in encoder-decoder transformer models. The text-VQA task requires a
model to answer a question by understanding multi-modal content: text
(typically from OCR) and an associated image. To the best of our knowledge,
almost all previous approaches for text-VQA process a single question and its
associated content to predict a single answer. In order to answer multiple
questions from the same image, each question and content are fed into the model
multiple times. In contrast, our proposed MQMA approach takes multiple
questions and content as input at the encoder and predicts multiple answers at
the decoder in an auto-regressive manner at the same time. We make several
novel architectural modifications to standard encoder-decoder transformers to
support MQMA. We also propose a novel MQMA denoising pre-training task which is
designed to teach the model to align and delineate multiple questions and
content with associated answers. MQMA pre-trained model achieves
state-of-the-art results on multiple text-VQA datasets, each with strong
baselines. Specifically, on OCR-VQA (+2.5%), TextVQA (+1.4%), ST-VQA (+0.6%),
DocVQA (+1.1%) absolute improvements over the previous state-of-the-art
approaches.
Related papers
- SEMQA: Semi-Extractive Multi-Source Question Answering [94.04430035121136]
We introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion.
We create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions.
arXiv Detail & Related papers (2023-11-08T18:46:32Z) - Diversity Enhanced Narrative Question Generation for Storybooks [4.043005183192124]
We introduce a multi-question generation model (mQG) capable of generating multiple, diverse, and answerable questions.
To validate the answerability of the generated questions, we employ a SQuAD2.0 fine-tuned question answering model.
mQG shows promising results across various evaluation metrics, among strong baselines.
arXiv Detail & Related papers (2023-10-25T08:10:04Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Question Answering Infused Pre-training of General-Purpose
Contextualized Representations [70.62967781515127]
We propose a pre-training objective based on question answering (QA) for learning general-purpose contextual representations.
We accomplish this goal by training a bi-encoder QA model, which independently encodes passages and questions, to match the predictions of a more accurate cross-encoder model.
We show large improvements over both RoBERTa-large and previous state-of-the-art results on zero-shot and few-shot paraphrase detection.
arXiv Detail & Related papers (2021-06-15T14:45:15Z) - ParaQA: A Question Answering Dataset with Paraphrase Responses for
Single-Turn Conversation [5.087932295628364]
ParaQA is a dataset with multiple paraphrased responses for single-turn conversation over knowledge graphs (KG)
The dataset was created using a semi-automated framework for generating diverse paraphrasing of the answers using techniques such as back-translation.
arXiv Detail & Related papers (2021-03-13T18:53:07Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.