Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language
Representation Learning with Pre-trained Sequence-to-Sequence Model
- URL: http://arxiv.org/abs/2106.15332v1
- Date: Thu, 24 Jun 2021 06:39:37 GMT
- Title: Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language
Representation Learning with Pre-trained Sequence-to-Sequence Model
- Authors: Yixuan Qiao, Hao Chen, Jun Wang, Yihao Chen, Xianbin Ye, Ziliang Li,
Xianbiao Qi, Peng Gao, Guotong Xie
- Abstract summary: TextVQA requires models to read and reason about text in images to answer questions about them.
In this challenge, we use generative model T5 for TextVQA task.
- Score: 18.848107244522666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: TextVQA requires models to read and reason about text in images to answer
questions about them. Specifically, models need to incorporate a new modality
of text present in the images and reason over it to answer TextVQA questions.
In this challenge, we use generative model T5 for TextVQA task. Based on
pre-trained checkpoint T5-3B from HuggingFace repository, two other
pre-training tasks including masked language modeling(MLM) and relative
position prediction(RPP) are designed to better align object feature and scene
text. In the stage of pre-training, encoder is dedicate to handle the fusion
among multiple modalities: question text, object text labels, scene text
labels, object visual features, scene visual features. After that decoder
generates the text sequence step-by-step, cross entropy loss is required by
default. We use a large-scale scene text dataset in pre-training and then
fine-tune the T5-3B with the TextVQA dataset only.
Related papers
- Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z) - Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling [12.233796960280944]
Text-VQA (Visual Question Answering) aims at question answering through reading text information in images.
LOGOS is a novel model which attempts to tackle this problem from multiple aspects.
arXiv Detail & Related papers (2021-08-20T01:31:51Z) - Question-controlled Text-aware Image Captioning [41.53906032024941]
Question-controlled Text-aware Image Captioning (Qc-TextCap) is a new challenging task.
With questions as control signals, our model generates more informative and diverse captions than the state-of-the-art text-aware captioning model.
GQAM generates a personalized text-aware caption with a Multimodal Decoder.
arXiv Detail & Related papers (2021-08-04T13:34:54Z) - TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.
TAP explicitly incorporates scene text (generated from OCR engines) in pre-training.
Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.