Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering
- URL: http://arxiv.org/abs/2403.09288v1
- Date: Thu, 14 Mar 2024 11:22:06 GMT
- Title: Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering
- Authors: Zhixuan Shen, Haonan Luo, Sijia Li, Tianrui Li,
- Abstract summary: Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content.
Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems.
We propose a multimodal adversarial training architecture with spatial awareness capabilities.
- Score: 8.382903851560595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems, and aggressive fine-tuning based on limited spatial location information and erroneous OCR text information often leads to inevitable overfitting. In this paper, we propose a multimodal adversarial training architecture with spatial awareness capabilities. Specifically, we introduce an Adversarial OCR Enhancement (AOE) module, which leverages adversarial training in the embedding space of OCR modality to enhance fault-tolerant representation of OCR texts, thereby reducing noise caused by OCR errors. Simultaneously, We add a Spatial-Aware Self-Attention (SASA) mechanism to help the model better capture the spatial relationships among OCR tokens. Various experiments demonstrate that our method achieves significant performance improvements on both the ST-VQA and TextVQA datasets and provides a novel paradigm for multimodal adversarial training.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting [8.397246652127793]
We propose a new pre-training method called OCR-Text Destylization Modeling (ODM)
ODM transfers diverse styles of text found in images to a uniform style based on the text prompt.
Our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks.
arXiv Detail & Related papers (2024-03-01T06:13:53Z) - UPOCR: Towards Unified Pixel-Level OCR Interface [36.966005829678124]
We propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface.
Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder.
Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection.
arXiv Detail & Related papers (2023-12-05T11:53:17Z) - Cross-modal Active Complementary Learning with Self-refining
Correspondence [54.61307946222386]
We propose a Cross-modal Robust Complementary Learning framework (CRCL) to improve the robustness of existing methods.
ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision.
SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences.
arXiv Detail & Related papers (2023-10-26T15:15:11Z) - PreSTU: Pre-Training for Scene-Text Understanding [49.288302725486226]
We propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU)
PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content.
We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.
arXiv Detail & Related papers (2022-09-12T18:29:55Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped
scene text [23.04601165885908]
We propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images.
We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR.
We use a TextOCR trained OCR model to create PixelM4C model which can do scene text based reasoning on an image in an end-to-end fashion.
arXiv Detail & Related papers (2021-05-12T07:50:42Z) - Contextualized Attention-based Knowledge Transfer for Spoken
Conversational Question Answering [63.72278693825945]
Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow.
We propose CADNet, a novel contextualized attention-based distillation approach.
We conduct extensive experiments on the Spoken-CoQA dataset and demonstrate that our approach achieves remarkable performance.
arXiv Detail & Related papers (2020-10-21T15:17:18Z) - Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones.
Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images.
We pro-pose a real scene text SR dataset, termed TextZoom.
It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.