Augmented Transformers with Adaptive n-grams Embedding for Multilingual
Scene Text Recognition
- URL: http://arxiv.org/abs/2302.14261v1
- Date: Tue, 28 Feb 2023 02:37:30 GMT
- Title: Augmented Transformers with Adaptive n-grams Embedding for Multilingual
Scene Text Recognition
- Authors: Xueming Yan, Zhihang Fang, Yaochu Jin
- Abstract summary: This paper proposes an augmented transformer architecture with n-grams embedding and cross-language rectification (TANGER)
TANGER consists of a primary transformer with single patch embeddings of visual images, and a supplementary transformer with adaptive n-grams embeddings.
Cross-language rectification is achieved with a loss function that takes into account both language identification and contextual coherence scoring.
- Score: 10.130342722193204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While vision transformers have been highly successful in improving the
performance in image-based tasks, not much work has been reported on applying
transformers to multilingual scene text recognition due to the complexities in
the visual appearance of multilingual texts. To fill the gap, this paper
proposes an augmented transformer architecture with n-grams embedding and
cross-language rectification (TANGER). TANGER consists of a primary transformer
with single patch embeddings of visual images, and a supplementary transformer
with adaptive n-grams embeddings that aims to flexibly explore the potential
correlations between neighbouring visual patches, which is essential for
feature extraction from multilingual scene texts. Cross-language rectification
is achieved with a loss function that takes into account both language
identification and contextual coherence scoring. Extensive comparative studies
are conducted on four widely used benchmark datasets as well as a new
multilingual scene text dataset containing Indonesian, English, and Chinese
collected from tourism scenes in Indonesia. Our experimental results
demonstrate that TANGER is considerably better compared to the
state-of-the-art, especially in handling complex multilingual scene texts.
Related papers
- Orientation-Independent Chinese Text Recognition in Scene Images [61.34060587461462]
We take the first attempt to extract orientation-independent visual features by disentangling content and orientation information of text images.
Specifically, we introduce a Character Image Reconstruction Network (CIRN) to recover corresponding printed character images with disentangled content and orientation information.
arXiv Detail & Related papers (2023-09-03T05:30:21Z) - Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation [1.9085074258303771]
We study the task of visually'' translating scene text from a source language to a target language.
Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image.
We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis.
arXiv Detail & Related papers (2023-08-06T05:23:25Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems
to Improve Language Understanding [73.24847320536813]
This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders.
Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU.
arXiv Detail & Related papers (2022-04-15T03:44:00Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Sign Language Transformers: Joint End-to-end Sign Language Recognition
and Translation [59.38247587308604]
We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation.
We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset.
Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
arXiv Detail & Related papers (2020-03-30T21:35:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.