Towards Escaping from Language Bias and OCR Error: Semantics-Centered
Text Visual Question Answering
- URL: http://arxiv.org/abs/2203.12929v1
- Date: Thu, 24 Mar 2022 08:21:41 GMT
- Title: Towards Escaping from Language Bias and OCR Error: Semantics-Centered
Text Visual Question Answering
- Authors: Chengyang Fang, Gangyan Zeng, Yu Zhou, Daiqing Wu, Can Ma, Dayong Hu,
Weiping Wang
- Abstract summary: Texts in scene images convey critical information for scene understanding and reasoning.
Current TextVQA models do not center on the text and suffer from several limitations.
We propose a novel Semantics-Centered Network (SC-Net) that consists of an instance-level contrastive semantic prediction module and a semantics-centered transformer module.
- Score: 14.010472385359163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Texts in scene images convey critical information for scene understanding and
reasoning. The abilities of reading and reasoning matter for the model in the
text-based visual question answering (TextVQA) process. However, current
TextVQA models do not center on the text and suffer from several limitations.
The model is easily dominated by language biases and optical character
recognition (OCR) errors due to the absence of semantic guidance in the answer
prediction process. In this paper, we propose a novel Semantics-Centered
Network (SC-Net) that consists of an instance-level contrastive semantic
prediction module (ICSP) and a semantics-centered transformer module (SCT).
Equipped with the two modules, the semantics-centered model can resist the
language biases and the accumulated errors from OCR. Extensive experiments on
TextVQA and ST-VQA datasets show the effectiveness of our model. SC-Net
surpasses previous works with a noticeable margin and is more reasonable for
the TextVQA task.
Related papers
- SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition [77.28814034644287]
We propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed.
SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context.
We evaluate SVTRv2 in both standard and recent challenging benchmarks.
arXiv Detail & Related papers (2024-11-24T14:21:35Z) - Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering [8.382903851560595]
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content.
Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems.
We propose a multimodal adversarial training architecture with spatial awareness capabilities.
arXiv Detail & Related papers (2024-03-14T11:22:06Z) - VQAttack: Transferable Adversarial Attacks on Visual Question Answering
via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules.
Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z) - Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models [36.19590638188108]
We create new variants of texts and images in the MS-COCO test set and re-evaluate the state-of-the-art (SOTA) models with the new data.
Specifically, we alter the meaning of text by replacing a word, and generate visually altered images that maintain some visual context.
Our evaluations on the proposed benchmark reveal substantial performance degradation in many SOTA models.
arXiv Detail & Related papers (2023-04-21T03:45:59Z) - LaTr: Layout-Aware Transformer for Scene-Text VQA [8.390314291424263]
We propose a novel architecture for Scene Text Visual Question Answering (STVQA)
We show that applying this pre-training scheme on scanned documents has certain advantages over using natural images.
Compared to existing approaches, our method performs vocabulary-free decoding and, as shown, generalizes well beyond the training vocabulary.
arXiv Detail & Related papers (2021-12-23T12:41:26Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.