Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards
Enhancing Text Spotting Performance
- URL: http://arxiv.org/abs/2310.00917v4
- Date: Wed, 1 Nov 2023 09:29:13 GMT
- Title: Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards
Enhancing Text Spotting Performance
- Authors: Alloy Das, Sanket Biswas, Ayan Banerjee, Josep Llad\'os, Umapada Pal,
and Saumik Bhattacharya
- Abstract summary: The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions.
Here, we investigate the problem of domain-adaptive scene text spotting, i.e., training a model on multi-domain source data.
The results clearly demonstrate the potential of intermediate representations to achieve significant performance on text spotting benchmarks across multiple domains.
- Score: 15.513912470752041
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The adaptation capability to a wide range of domains is crucial for scene
text spotting models when deployed to real-world conditions. However, existing
state-of-the-art (SOTA) approaches usually incorporate scene text detection and
recognition simply by pretraining on natural scene text datasets, which do not
directly exploit the intermediate feature representations between multiple
domains. Here, we investigate the problem of domain-adaptive scene text
spotting, i.e., training a model on multi-domain source data such that it can
directly adapt to target domains rather than being specialized for a specific
domain or scenario. Further, we investigate a transformer baseline called
Swin-TESTR to focus on solving scene-text spotting for both regular and
arbitrary-shaped scene text along with an exhaustive evaluation. The results
clearly demonstrate the potential of intermediate representations to achieve
significant performance on text spotting benchmarks across multiple domains
(e.g. language, synth-to-real, and documents). both in terms of accuracy and
efficiency.
Related papers
- WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation.
We first estimate the language embedding with fine-grained alignment, which can be used to adaptively identify and then remove domain-specific counterpart.
We show that WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT.
arXiv Detail & Related papers (2024-05-28T17:46:27Z) - Unified Language-driven Zero-shot Domain Adaptation [55.64088594551629]
Unified Language-driven Zero-shot Domain Adaptation (ULDA) is a novel task setting.
It enables a single model to adapt to diverse target domains without explicit domain-ID knowledge.
arXiv Detail & Related papers (2024-04-10T16:44:11Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes [11.478236584340255]
We present a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes.
We also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter.
The dataset, code and pre-trained models will be released upon acceptance.
arXiv Detail & Related papers (2023-10-01T03:27:41Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Domain Adaptive Scene Text Detection via Subcategorization [45.580559833129165]
We study domain adaptive scene text detection, a largely neglected yet very meaningful task.
We design SCAST, a subcategory-aware self-training technique that mitigates the network overfitting and noisy pseudo labels.
SCAST achieves superior detection performance consistently across multiple public benchmarks.
arXiv Detail & Related papers (2022-12-01T09:15:43Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z) - Contextual-Relation Consistent Domain Adaptation for Semantic
Segmentation [44.19436340246248]
This paper presents an innovative local contextual-relation consistent domain adaptation technique.
It aims to achieve local-level consistencies during the global-level alignment.
Experiments demonstrate its superior segmentation performance as compared with state-of-the-art methods.
arXiv Detail & Related papers (2020-07-05T19:00:46Z) - Text Recognition in Real Scenarios with a Few Labeled Samples [55.07859517380136]
Scene text recognition (STR) is still a hot research topic in computer vision field.
This paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation.
Our approach can maximize the character-level confusion between the source domain and the target domain.
arXiv Detail & Related papers (2020-06-22T13:03:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.