On Vocabulary Reliance in Scene Text Recognition
- URL: http://arxiv.org/abs/2005.03959v1
- Date: Fri, 8 May 2020 11:16:58 GMT
- Title: On Vocabulary Reliance in Scene Text Recognition
- Authors: Zhaoyi Wan, Jielei Zhang, Liang Zhang, Jiebo Luo, Cong Yao
- Abstract summary: Methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary.
We call this phenomenon "vocabulary reliance"
We propose a simple yet effective mutual learning strategy to allow models of two families to learn collaboratively.
- Score: 79.21737876442253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pursuit of high performance on public benchmarks has been the driving
force for research in scene text recognition, and notable progress has been
achieved. However, a close investigation reveals a startling fact that the
state-of-the-art methods perform well on images with words within vocabulary
but generalize poorly to images with words outside vocabulary. We call this
phenomenon "vocabulary reliance". In this paper, we establish an analytical
framework to conduct an in-depth study on the problem of vocabulary reliance in
scene text recognition. Key findings include: (1) Vocabulary reliance is
ubiquitous, i.e., all existing algorithms more or less exhibit such
characteristic; (2) Attention-based decoders prove weak in generalizing to
words outside vocabulary and segmentation-based decoders perform well in
utilizing visual features; (3) Context modeling is highly coupled with the
prediction layers. These findings provide new insights and can benefit future
research in scene text recognition. Furthermore, we propose a simple yet
effective mutual learning strategy to allow models of two families
(attention-based and segmentation-based) to learn collaboratively. This remedy
alleviates the problem of vocabulary reliance and improves the overall scene
text recognition performance.
Related papers
- Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Unifying Latent and Lexicon Representations for Effective Video-Text
Retrieval [87.69394953339238]
We propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics in video-text retrieval.
We show our framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.
arXiv Detail & Related papers (2024-02-26T17:36:50Z) - Towards Open Vocabulary Learning: A Survey [146.90188069113213]
Deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection.
Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training.
This paper provides a thorough review of open vocabulary learning, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2023-06-28T02:33:06Z) - CLIPTER: Looking at the Bigger Picture in Scene Text Recognition [10.561377899703238]
We harness the capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer.
We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a cross-attention gated mechanism.
arXiv Detail & Related papers (2023-01-18T12:16:19Z) - Language with Vision: a Study on Grounded Word and Sentence Embeddings [6.231247903840833]
Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations.
The present study proposes a computational grounding model for pre-trained word embeddings.
Our model effectively balances the interplay between language and vision by aligning textual embeddings with visual information.
arXiv Detail & Related papers (2022-06-17T15:04:05Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - Deep learning models for representing out-of-vocabulary words [1.4502611532302039]
We present a performance evaluation of deep learning models for representing out-of-vocabulary (OOV) words.
Although the best technique for handling OOV words is different for each task, Comick, a deep learning method that infers the embedding based on the context and the morphological structure of the OOV word, obtained promising results.
arXiv Detail & Related papers (2020-07-14T19:31:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.