Efficiently Leveraging Linguistic Priors for Scene Text Spotting
- URL: http://arxiv.org/abs/2402.17134v1
- Date: Tue, 27 Feb 2024 01:57:09 GMT
- Title: Efficiently Leveraging Linguistic Priors for Scene Text Spotting
- Authors: Nguyen Nguyen, Yapeng Tian, Chenliang Xu
- Abstract summary: This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
- Score: 63.22351047545888
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Incorporating linguistic knowledge can improve scene text recognition, but it
is questionable whether the same holds for scene text spotting, which typically
involves text detection and recognition. This paper proposes a method that
leverages linguistic knowledge from a large text corpus to replace the
traditional one-hot encoding used in auto-regressive scene text spotting and
recognition models. This allows the model to capture the relationship between
characters in the same word. Additionally, we introduce a technique to generate
text distributions that align well with scene text datasets, removing the need
for in-domain fine-tuning. As a result, the newly created text distributions
are more informative than pure one-hot encoding, leading to improved spotting
and recognition performance. Our method is simple and efficient, and it can
easily be integrated into existing auto-regressive-based approaches.
Experimental results show that our method not only improves recognition
accuracy but also enables more accurate localization of words. It significantly
improves both state-of-the-art scene text spotting and recognition pipelines,
achieving state-of-the-art results on several benchmarks.
Related papers
- LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model [20.007650672107566]
Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos.
Recent methods track the zero-shot results of state-of-the-art image text spotters directly.
Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements.
arXiv Detail & Related papers (2024-05-29T15:35:09Z) - JSTR: Judgment Improves Scene Text Recognition [0.0]
We present a method for enhancing the accuracy of scene text recognition tasks by judging whether the image and text match each other.
This method boosts text recognition accuracy by providing explicit feedback on the data that the model is likely to misrecognize.
arXiv Detail & Related papers (2024-04-09T02:55:12Z) - SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting [126.01629300244001]
We propose a new end-to-end scene text spotting framework termed SwinTextSpotter v2.
We enhance the relationship between two tasks using novel Recognition Conversion and Recognition Alignment modules.
SwinTextSpotter v2 achieved state-of-the-art performance on various multilingual (English, Chinese, and Vietnamese) benchmarks.
arXiv Detail & Related papers (2024-01-15T12:33:00Z) - IPAD: Iterative, Parallel, and Diffusion-based Network for Scene Text Recognition [5.525052547053668]
Scene text recognition has attracted more and more attention due to its diverse applications.
Most state-of-the-art methods adopt an encoder-decoder framework with the attention mechanism, autoregressively generating text from left to right.
We propose an alternative solution, using a parallel and iterative decoder that adopts an easy-first decoding strategy.
arXiv Detail & Related papers (2023-12-19T08:03:19Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - Reading and Writing: Discriminative and Generative Modeling for
Self-Supervised Text Recognition [101.60244147302197]
We introduce contrastive learning and masked image modeling to learn discrimination and generation of text images.
Our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets.
Our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size.
arXiv Detail & Related papers (2022-07-01T03:50:26Z) - SwinTextSpotter: Scene Text Spotting via Better Synergy between Text
Detection and Text Recognition [73.61592015908353]
We propose a new end-to-end scene text spotting framework termed SwinTextSpotter.
Using a transformer with dynamic head as the detector, we unify the two tasks with a novel Recognition Conversion mechanism.
The design results in a concise framework that requires neither additional rectification module nor character-level annotation.
arXiv Detail & Related papers (2022-03-19T01:14:42Z) - TextAdaIN: Fine-Grained AdaIN for Robust Text Recognition [3.3946853660795884]
In text recognition, we reveal that it is rather the local image statistics which the networks overly depend on.
We suggest an approach to regulate the reliance on local statistics that improves overall text recognition performance.
Our method, termed TextAdaIN, creates local distortions in the feature map which prevent the network from overfitting to the local statistics.
arXiv Detail & Related papers (2021-05-09T10:47:48Z) - Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text.
We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images.
In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z) - Text Recognition -- Real World Data and Where to Find Them [36.10220484561196]
We present a method for exploiting weakly annotated images to improve text extraction pipelines.
The approach uses an arbitrary end-to-end text recognition system to obtain text region proposals and their, possibly erroneous, transcriptions.
It produces nearly error-free, localised instances of scene text, which we treat as "pseudo ground truth" (PGT)
arXiv Detail & Related papers (2020-07-06T22:23:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.