Structured Multimodal Attentions for TextVQA
- URL: http://arxiv.org/abs/2006.00753v2
- Date: Fri, 26 Nov 2021 03:00:58 GMT
- Title: Structured Multimodal Attentions for TextVQA
- Authors: Chenyu Gao and Qi Zhu and Peng Wang and Hui Li and Yuliang Liu and
Anton van den Hengel and Qi Wu
- Abstract summary: We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
- Score: 57.71060302874151
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose an end-to-end structured multimodal attention (SMA)
neural network to mainly solve the first two issues above. SMA first uses a
structural graph representation to encode the object-object, object-text and
text-text relationships appearing in the image, and then designs a multimodal
graph attention network to reason over it. Finally, the outputs from the above
modules are processed by a global-local attentional answering module to produce
an answer splicing together tokens from both OCR and general vocabulary
iteratively by following M4C. Our proposed model outperforms the SoTA models on
TextVQA dataset and two tasks of ST-VQA dataset among all models except
pre-training based TAP. Demonstrating strong reasoning ability, it also won
first place in TextVQA Challenge 2020. We extensively test different OCR
methods on several reasoning models and investigate the impact of gradually
increased OCR performance on TextVQA benchmark. With better OCR results,
different models share dramatic improvement over the VQA accuracy, but our
model benefits most blessed by strong textual-visual reasoning ability. To
grant our method an upper bound and make a fair testing base available for
further works, we also provide human-annotated ground-truth OCR annotations for
the TextVQA dataset, which were not given in the original release. The code and
ground-truth OCR annotations for the TextVQA dataset are available at
https://github.com/ChenyuGAO-CS/SMA
Related papers
- Separate and Locate: Rethink the Text in Text-based Visual Question
Answering [15.84929733099542]
We propose Separate and Locate (SaL) to explore text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts.
Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets.
arXiv Detail & Related papers (2023-08-31T01:00:59Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Towards Escaping from Language Bias and OCR Error: Semantics-Centered
Text Visual Question Answering [14.010472385359163]
Texts in scene images convey critical information for scene understanding and reasoning.
Current TextVQA models do not center on the text and suffer from several limitations.
We propose a novel Semantics-Centered Network (SC-Net) that consists of an instance-level contrastive semantic prediction module and a semantics-centered transformer module.
arXiv Detail & Related papers (2022-03-24T08:21:41Z) - LaTr: Layout-Aware Transformer for Scene-Text VQA [8.390314291424263]
We propose a novel architecture for Scene Text Visual Question Answering (STVQA)
We show that applying this pre-training scheme on scanned documents has certain advantages over using natural images.
Compared to existing approaches, our method performs vocabulary-free decoding and, as shown, generalizes well beyond the training vocabulary.
arXiv Detail & Related papers (2021-12-23T12:41:26Z) - Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling [12.233796960280944]
Text-VQA (Visual Question Answering) aims at question answering through reading text information in images.
LOGOS is a novel model which attempts to tackle this problem from multiple aspects.
arXiv Detail & Related papers (2021-08-20T01:31:51Z) - Question Answering Infused Pre-training of General-Purpose
Contextualized Representations [70.62967781515127]
We propose a pre-training objective based on question answering (QA) for learning general-purpose contextual representations.
We accomplish this goal by training a bi-encoder QA model, which independently encodes passages and questions, to match the predictions of a more accurate cross-encoder model.
We show large improvements over both RoBERTa-large and previous state-of-the-art results on zero-shot and few-shot paraphrase detection.
arXiv Detail & Related papers (2021-06-15T14:45:15Z) - TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.
TAP explicitly incorporates scene text (generated from OCR engines) in pre-training.
Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.