Pushing the Performance Limit of Scene Text Recognizer without Human
Annotation
- URL: http://arxiv.org/abs/2204.07714v1
- Date: Sat, 16 Apr 2022 04:42:02 GMT
- Title: Pushing the Performance Limit of Scene Text Recognizer without Human
Annotation
- Authors: Caiyuan Zheng, Hui Li, Seon-Min Rhee, Seungju Han, Jae-Joon Han, Peng
Wang
- Abstract summary: We aim to boost STR models by leveraging both synthetic data and the numerous real unlabeled images.
A character-level consistency regularization is designed to mitigate the misalignment between characters in sequence recognition.
- Score: 17.092815629040388
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene text recognition (STR) attracts much attention over the years because
of its wide application. Most methods train STR model in a fully supervised
manner which requires large amounts of labeled data. Although synthetic data
contributes a lot to STR, it suffers from the real-tosynthetic domain gap that
restricts model performance. In this work, we aim to boost STR models by
leveraging both synthetic data and the numerous real unlabeled images,
exempting human annotation cost thoroughly. A robust consistency regularization
based semi-supervised framework is proposed for STR, which can effectively
solve the instability issue due to domain inconsistency between synthetic and
real images. A character-level consistency regularization is designed to
mitigate the misalignment between characters in sequence recognition. Extensive
experiments on standard text recognition benchmarks demonstrate the
effectiveness of the proposed method. It can steadily improve existing STR
models, and boost an STR model to achieve new state-of-the-art results. To our
best knowledge, this is the first consistency regularization based framework
that applies successfully to STR.
Related papers
- Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing [71.29488677105127]
Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters.
We propose a contrastive learning-based STR framework by leveraging synthetic and real unlabeled data without any human cost.
Our method achieves SOTA performance (94.7% and 70.9% average accuracy on common benchmarks and Union14M-Benchmark.
arXiv Detail & Related papers (2024-11-23T15:24:47Z) - Sequential Visual and Semantic Consistency for Semi-supervised Text
Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training.
Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models.
This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z) - SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor [32.29602765394547]
Scene Text Recognition is an important and challenging upstream task for building structured information databases.
Current state-of-the-art (SOTA) models for STR exhibit high performance, but suffer from low inference efficiency.
We propose a VIsion Permutable extractor for fast and efficient Scene Text Recognition.
arXiv Detail & Related papers (2024-01-18T16:27:09Z) - Multi-Granularity Prediction with Learnable Fusion for Scene Text
Recognition [20.48454415635795]
Scene text recognition (STR) has been an active research topic in computer vision for years.
To tackle this tough problem, numerous innovative methods have been proposed, and incorporating linguistic knowledge into STR models has recently become a prominent trend.
In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet functionally powerful vision STR model.
It already outperforms most previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods.
arXiv Detail & Related papers (2023-07-25T04:12:50Z) - Text is Text, No Matter What: Unifying Text Recognition using Knowledge
Distillation [41.43280922432707]
We argue for their unification -- we aim for a single model that can compete favourably with two separate state-of-the-art STR and HTR models.
We first show that cross-utilisation of STR and HTR models trigger significant performance drops due to differences in their inherent challenges.
We then tackle their union by introducing a knowledge distillation (KD) based framework.
arXiv Detail & Related papers (2021-07-26T10:10:34Z) - Enhancing the Generalization for Intent Classification and Out-of-Domain
Detection in SLU [70.44344060176952]
Intent classification is a major task in spoken language understanding (SLU)
Recent works have shown that using extra data and labels can improve the OOD detection performance.
This paper proposes to train a model with only IND data while supporting both IND intent classification and OOD detection.
arXiv Detail & Related papers (2021-06-28T08:27:38Z) - What If We Only Use Real Datasets for Scene Text Recognition? Toward
Scene Text Recognition With Fewer Labels [53.51264148594141]
Scene text recognition (STR) task has a common practice: All state-of-the-art STR models are trained on large synthetic data.
Training STR models on real data is nearly impossible because real data is insufficient.
We show that we can train STR models satisfactorily only with real labeled data.
arXiv Detail & Related papers (2021-03-07T17:05:54Z) - Text Recognition in Real Scenarios with a Few Labeled Samples [55.07859517380136]
Scene text recognition (STR) is still a hot research topic in computer vision field.
This paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation.
Our approach can maximize the character-level confusion between the source domain and the target domain.
arXiv Detail & Related papers (2020-06-22T13:03:01Z) - AutoSTR: Efficient Backbone Search for Scene Text Recognition [80.7290173000068]
Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes.
We propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance.
Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks.
arXiv Detail & Related papers (2020-03-14T06:51:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.