Separate and Locate: Rethink the Text in Text-based Visual Question
Answering
- URL: http://arxiv.org/abs/2308.16383v1
- Date: Thu, 31 Aug 2023 01:00:59 GMT
- Title: Separate and Locate: Rethink the Text in Text-based Visual Question
Answering
- Authors: Chengyang Fang, Jiangnan Li, Liang Li, Can Ma, Dayong Hu
- Abstract summary: We propose Separate and Locate (SaL) to explore text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts.
Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets.
- Score: 15.84929733099542
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based Visual Question Answering (TextVQA) aims at answering questions
about the text in images. Most works in this field focus on designing network
structures or pre-training tasks. All these methods list the OCR texts in
reading order (from left to right and top to bottom) to form a sequence, which
is treated as a natural language ``sentence''. However, they ignore the fact
that most OCR words in the TextVQA task do not have a semantical contextual
relationship. In addition, these approaches use 1-D position embedding to
construct the spatial relation between OCR tokens sequentially, which is not
reasonable. The 1-D position embedding can only represent the left-right
sequence relationship between words in a sentence, but not the complex spatial
position relationship. To tackle these problems, we propose a novel method
named Separate and Locate (SaL) that explores text contextual cues and designs
spatial position embedding to construct spatial relations between OCR texts.
Specifically, we propose a Text Semantic Separate (TSS) module that helps the
model recognize whether words have semantic contextual relations. Then, we
introduce a Spatial Circle Position (SCP) module that helps the model better
construct and reason the spatial position relationships between OCR texts. Our
SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA
and ST-VQA datasets. Compared with the pre-training state-of-the-art method
pre-trained on 64 million pre-training samples, our method, without any
pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on
TextVQA and ST-VQA. Our code and models will be released at
https://github.com/fangbufang/SaL.
Related papers
- Locate Then Generate: Bridging Vision and Language with Bounding Box for
Scene-Text VQA [15.74007067413724]
We propose a novel framework for Scene Text Visual Question Answering (STVQA)
It requires models to read scene text in images for question answering.
arXiv Detail & Related papers (2023-04-04T07:46:40Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - LaTr: Layout-Aware Transformer for Scene-Text VQA [8.390314291424263]
We propose a novel architecture for Scene Text Visual Question Answering (STVQA)
We show that applying this pre-training scheme on scanned documents has certain advantages over using natural images.
Compared to existing approaches, our method performs vocabulary-free decoding and, as shown, generalizes well beyond the training vocabulary.
arXiv Detail & Related papers (2021-12-23T12:41:26Z) - Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language
Representation Learning with Pre-trained Sequence-to-Sequence Model [18.848107244522666]
TextVQA requires models to read and reason about text in images to answer questions about them.
In this challenge, we use generative model T5 for TextVQA task.
arXiv Detail & Related papers (2021-06-24T06:39:37Z) - TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.
TAP explicitly incorporates scene text (generated from OCR engines) in pre-training.
Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z) - Spatially Aware Multimodal Transformers for TextVQA [61.01618988620582]
We study the TextVQA task, i.e., reasoning about text in images to answer a question.
Existing approaches are limited in their use of spatial relations.
We propose a novel spatially aware self-attention layer.
arXiv Detail & Related papers (2020-07-23T17:20:55Z) - Rethinking Positional Encoding in Language Pre-training [111.2320727291926]
We show that in absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations.
We propose a new positional encoding method called textbfTransformer with textbfUntied textPositional textbfEncoding (T)
arXiv Detail & Related papers (2020-06-28T13:11:02Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.