SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting
- URL: http://arxiv.org/abs/2504.09966v1
- Date: Mon, 14 Apr 2025 08:09:17 GMT
- Title: SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting
- Authors: Dongliang Luo, Hanshen Zhu, Ziyang Zhang, Dingkang Liang, Xudong Xie, Yuliang Liu, Xiang Bai,
- Abstract summary: We propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS.<n>Specifically, it gradually generates reliable hierarchical pseudo labels for each task, thereby reducing noisy labels.<n>It extracts important information in locations and transcriptions from bidirectional flows to improve consistency.
- Score: 59.14029549151904
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most previous scene text spotting methods rely on high-quality manual annotations to achieve promising performance. To reduce their expensive costs, we study semi-supervised text spotting (SSTS) to exploit useful information from unlabeled images. However, directly applying existing semi-supervised methods of general scenes to SSTS will face new challenges: 1) inconsistent pseudo labels between detection and recognition tasks, and 2) sub-optimal supervisions caused by inconsistency between teacher/student. Thus, we propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS that leverages the complementarity of text detection and recognition. Specifically, it gradually generates reliable hierarchical pseudo labels for each task, thereby reducing noisy labels. Meanwhile, it extracts important information in locations and transcriptions from bidirectional flows to improve consistency. Extensive experiments on three datasets under various settings demonstrate the effectiveness of SemiETS on arbitrary-shaped text. For example, it outperforms previous state-of-the-art SSL methods by a large margin on end-to-end spotting (+8.7%, +5.6%, and +2.6% H-mean under 0.5%, 1%, and 2% labeled data settings on Total-Text, respectively). More importantly, it still improves upon a strongly supervised text spotter trained with plenty of labeled data by 2.0%. Compelling domain adaptation ability shows practical potential. Moreover, our method demonstrates consistent improvement on different text spotters.
Related papers
- 3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation [73.877177695218]
3D Referring Expression (3D-RES) typically requires extensive instance-level annotations, which are time-consuming and costly.
Semi-supervised learning (SSL) mitigates this by using limited labeled data alongside abundant unlabeled data, improving performance while reducing annotation costs.
In this paper, we introduce the first semi-supervised learning framework for 3D-RES, presenting a robust baseline method named 3DResT.
arXiv Detail & Related papers (2025-04-17T02:50:52Z) - Exploring Scene Affinity for Semi-Supervised LiDAR Semantic Segmentation [28.480775624544478]
This paper explores scene affinity, namely intra-scene consistency and inter-scene correlation, for semi-supervised LiDAR semantic segmentation in driving scenes.<n>Adopting teacher-student training, AIScene employs a teacher network to generate pseudo-labeled scenes from unlabeled data, which then supervise the student network's learning.<n>Experiments show that AIScene outperforms previous methods on two popular benchmarks across four settings, achieving notable improvements of 1.9% and 2.1% in the most challenging 1% labeled data.
arXiv Detail & Related papers (2024-08-21T02:03:03Z) - Sequential Visual and Semantic Consistency for Semi-supervised Text
Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training.
Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models.
This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - SwinTextSpotter: Scene Text Spotting via Better Synergy between Text
Detection and Text Recognition [73.61592015908353]
We propose a new end-to-end scene text spotting framework termed SwinTextSpotter.
Using a transformer with dynamic head as the detector, we unify the two tasks with a novel Recognition Conversion mechanism.
The design results in a concise framework that requires neither additional rectification module nor character-level annotation.
arXiv Detail & Related papers (2022-03-19T01:14:42Z) - Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer [21.479222207347238]
We introduce TextTranSpotter (TTS), a transformer-based approach for text spotting.
TTS is trained with both fully- and weakly-supervised settings.
trained in a fully-supervised manner, TextTranSpotter shows state-of-the-art results on multiple benchmarks.
arXiv Detail & Related papers (2022-02-11T08:50:09Z) - WSSOD: A New Pipeline for Weakly- and Semi-Supervised Object Detection [75.80075054706079]
We propose a weakly- and semi-supervised object detection framework (WSSOD)
An agent detector is first trained on a joint dataset and then used to predict pseudo bounding boxes on weakly-annotated images.
The proposed framework demonstrates remarkable performance on PASCAL-VOC and MSCOCO benchmark, achieving a high performance comparable to those obtained in fully-supervised settings.
arXiv Detail & Related papers (2021-05-21T11:58:50Z) - Weakly-Supervised Arbitrary-Shaped Text Detection with
Expectation-Maximization Algorithm [35.0126313032923]
We study weakly-supervised arbitrary-shaped text detection for combining various weak supervision forms.
We propose an Expectation-Maximization (EM) based weakly-supervised learning framework to train an accurate arbitrary-shaped text detector.
Our method yields comparable performance to state-of-the-art methods on three benchmarks.
arXiv Detail & Related papers (2020-12-01T11:45:39Z) - Text Recognition -- Real World Data and Where to Find Them [36.10220484561196]
We present a method for exploiting weakly annotated images to improve text extraction pipelines.
The approach uses an arbitrary end-to-end text recognition system to obtain text region proposals and their, possibly erroneous, transcriptions.
It produces nearly error-free, localised instances of scene text, which we treat as "pseudo ground truth" (PGT)
arXiv Detail & Related papers (2020-07-06T22:23:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.