Spatially Aware Multimodal Transformers for TextVQA
- URL: http://arxiv.org/abs/2007.12146v2
- Date: Wed, 23 Dec 2020 03:10:07 GMT
- Title: Spatially Aware Multimodal Transformers for TextVQA
- Authors: Yash Kant, Dhruv Batra, Peter Anderson, Alex Schwing, Devi Parikh,
Jiasen Lu, Harsh Agrawal
- Abstract summary: We study the TextVQA task, i.e., reasoning about text in images to answer a question.
Existing approaches are limited in their use of spatial relations.
We propose a novel spatially aware self-attention layer.
- Score: 61.01618988620582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Textual cues are essential for everyday tasks like buying groceries and using
public transport. To develop this assistive technology, we study the TextVQA
task, i.e., reasoning about text in images to answer a question. Existing
approaches are limited in their use of spatial relations and rely on
fully-connected transformer-like architectures to implicitly learn the spatial
structure of a scene. In contrast, we propose a novel spatially aware
self-attention layer such that each visual entity only looks at neighboring
entities defined by a spatial graph. Further, each head in our multi-head
self-attention layer focuses on a different subset of relations. Our approach
has two advantages: (1) each head considers local context instead of dispersing
the attention amongst all visual entities; (2) we avoid learning redundant
features. We show that our model improves the absolute accuracy of current
state-of-the-art methods on TextVQA by 2.2% overall over an improved baseline,
and 4.62% on questions that involve spatial reasoning and can be answered
correctly using OCR tokens. Similarly on ST-VQA, we improve the absolute
accuracy by 4.2%. We further show that spatially aware self-attention improves
visual grounding.
Related papers
- TIPS: Text-Image Pretraining with Spatial Awareness [13.38247732379754]
Self-supervised image-only pretraining is still the go-to method for many vision applications.
We propose a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks.
arXiv Detail & Related papers (2024-10-21T21:05:04Z) - SceneGATE: Scene-Graph based co-Attention networks for TExt visual
question answering [2.8974040580489198]
The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA.
It reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words.
It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image.
arXiv Detail & Related papers (2022-12-16T05:10:09Z) - Toward 3D Spatial Reasoning for Human-like Text-based Visual Question
Answering [23.083935053799145]
Text-based Visual Question Answering(TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts.
We introduce 3D geometric information into a human-like spatial reasoning process to capture key objects' contextual knowledge.
Our method achieves state-of-the-art performance on TextVQA and ST-VQA datasets.
arXiv Detail & Related papers (2022-09-21T12:49:14Z) - Barlow constrained optimization for Visual Question Answering [105.3372546782068]
We propose a novel regularization for VQA models, Constrained Optimization using Barlow's theory (COB)
Our model also aligns the joint space with the answer embedding space, where we consider the answer and image+question as two different views' of what in essence is the same semantic information.
When built on the state-of-the-art GGE model, the resulting model improves VQA accuracy by 1.4% and 4% on the VQA-CP v2 and VQA v2 datasets respectively.
arXiv Detail & Related papers (2022-03-07T21:27:40Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - LaTr: Layout-Aware Transformer for Scene-Text VQA [8.390314291424263]
We propose a novel architecture for Scene Text Visual Question Answering (STVQA)
We show that applying this pre-training scheme on scanned documents has certain advantages over using natural images.
Compared to existing approaches, our method performs vocabulary-free decoding and, as shown, generalizes well beyond the training vocabulary.
arXiv Detail & Related papers (2021-12-23T12:41:26Z) - Vectorization and Rasterization: Self-Supervised Learning for Sketch and
Handwriting [168.91748514706995]
We propose two novel cross-modal translation pre-text tasks for self-supervised feature learning: Vectorization and Rasterization.
Our learned encoder modules benefit both-based and vector-based downstream approaches to analysing hand-drawn data.
arXiv Detail & Related papers (2021-03-25T09:47:18Z) - Robust Person Re-Identification through Contextual Mutual Boosting [77.1976737965566]
We propose the Contextual Mutual Boosting Network (CMBN) to localize pedestrians.
It localizes pedestrians and recalibrates features by effectively exploiting contextual information and statistical inference.
Experiments on the benchmarks demonstrate the superiority of the architecture compared the state-of-the-art.
arXiv Detail & Related papers (2020-09-16T06:33:35Z) - A Novel Attention-based Aggregation Function to Combine Vision and
Language [55.7633883960205]
We propose a novel fully-attentive reduction method for vision and language.
Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention.
We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices.
arXiv Detail & Related papers (2020-04-27T18:09:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.