LaTr: Layout-Aware Transformer for Scene-Text VQA
- URL: http://arxiv.org/abs/2112.12494v2
- Date: Fri, 24 Dec 2021 11:06:59 GMT
- Title: LaTr: Layout-Aware Transformer for Scene-Text VQA
- Authors: Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, R.
Manmatha
- Abstract summary: We propose a novel architecture for Scene Text Visual Question Answering (STVQA)
We show that applying this pre-training scheme on scanned documents has certain advantages over using natural images.
Compared to existing approaches, our method performs vocabulary-free decoding and, as shown, generalizes well beyond the training vocabulary.
- Score: 8.390314291424263
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel multimodal architecture for Scene Text Visual Question
Answering (STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA
requires models to reason over different modalities. Thus, we first investigate
the impact of each modality, and reveal the importance of the language module,
especially when enriched with layout information. Accounting for this, we
propose a single objective pre-training scheme that requires only text and
spatial cues. We show that applying this pre-training scheme on scanned
documents has certain advantages over using natural images, despite the domain
gap. Scanned documents are easy to procure, text-dense and have a variety of
layouts, helping the model learn various spatial cues (e.g. left-of, below
etc.) by tying together language and layout information. Compared to existing
approaches, our method performs vocabulary-free decoding and, as shown,
generalizes well beyond the training vocabulary. We further demonstrate that
LaTr improves robustness towards OCR errors, a common reason for failure cases
in STVQA. In addition, by leveraging a vision transformer, we eliminate the
need for an external object detector. LaTr outperforms state-of-the-art STVQA
methods on multiple datasets. In particular, +7.6% on TextVQA, +10.8% on ST-VQA
and +4.0% on OCR-VQA (all absolute accuracy numbers).
Related papers
- Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z) - Separate and Locate: Rethink the Text in Text-based Visual Question
Answering [15.84929733099542]
We propose Separate and Locate (SaL) to explore text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts.
Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets.
arXiv Detail & Related papers (2023-08-31T01:00:59Z) - Text Reading Order in Uncontrolled Conditions by Sparse Graph
Segmentation [71.40119152422295]
We propose a lightweight, scalable and generalizable approach to identify text reading order.
The model is language-agnostic and runs effectively across multi-language datasets.
It is small enough to be deployed on virtually any platform including mobile devices.
arXiv Detail & Related papers (2023-05-04T06:21:00Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Towards Escaping from Language Bias and OCR Error: Semantics-Centered
Text Visual Question Answering [14.010472385359163]
Texts in scene images convey critical information for scene understanding and reasoning.
Current TextVQA models do not center on the text and suffer from several limitations.
We propose a novel Semantics-Centered Network (SC-Net) that consists of an instance-level contrastive semantic prediction module and a semantics-centered transformer module.
arXiv Detail & Related papers (2022-03-24T08:21:41Z) - Graph Relation Transformer: Incorporating pairwise object features into
the Transformer architecture [0.0]
TextVQA is a dataset geared towards answering questions about visual objects and text objects in images.
One key challenge in TextVQA is the design of a system that effectively reasons not only about visual and text objects individually, but also about the spatial relationships between these objects.
We propose a Graph Relation Transformer (GRT) which uses edge information in addition to node information for graph attention computation in the Transformer.
arXiv Detail & Related papers (2021-11-11T06:55:28Z) - Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling [12.233796960280944]
Text-VQA (Visual Question Answering) aims at question answering through reading text information in images.
LOGOS is a novel model which attempts to tackle this problem from multiple aspects.
arXiv Detail & Related papers (2021-08-20T01:31:51Z) - TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.
TAP explicitly incorporates scene text (generated from OCR engines) in pre-training.
Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z) - Spatially Aware Multimodal Transformers for TextVQA [61.01618988620582]
We study the TextVQA task, i.e., reasoning about text in images to answer a question.
Existing approaches are limited in their use of spatial relations.
We propose a novel spatially aware self-attention layer.
arXiv Detail & Related papers (2020-07-23T17:20:55Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.