Relational Contrastive Learning for Scene Text Recognition
- URL: http://arxiv.org/abs/2308.00508v1
- Date: Tue, 1 Aug 2023 12:46:58 GMT
- Title: Relational Contrastive Learning for Scene Text Recognition
- Authors: Jinglei Zhang, Tiancheng Lin, Yi Xu, Kai Chen, Rui Zhang
- Abstract summary: We argue that prior contextual information can be interpreted as relations of textual primitives due to the heterogeneous text and background.
We propose to enrich the textual relations via rearrangement, hierarchy and interaction, and design a unified framework called RCLSTR: Contrastive Learning for Scene Text Recognition.
- Score: 22.131554868199782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Context-aware methods achieved great success in supervised scene text
recognition via incorporating semantic priors from words. We argue that such
prior contextual information can be interpreted as the relations of textual
primitives due to the heterogeneous text and background, which can provide
effective self-supervised labels for representation learning. However, textual
relations are restricted to the finite size of dataset due to lexical
dependencies, which causes the problem of over-fitting and compromises
representation robustness. To this end, we propose to enrich the textual
relations via rearrangement, hierarchy and interaction, and design a unified
framework called RCLSTR: Relational Contrastive Learning for Scene Text
Recognition. Based on causality, we theoretically explain that three modules
suppress the bias caused by the contextual prior and thus guarantee
representation robustness. Experiments on representation quality show that our
method outperforms state-of-the-art self-supervised STR methods. Code is
available at https://github.com/ThunderVVV/RCLSTR.
Related papers
- Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition [36.59116507158687]
We introduce a unified framework of Contrastive Learning and Masked Image Modeling for STR (RCMSTR)
The proposed RCMSTR demonstrates superior performance in various STR-related downstream tasks, outperforming the existing state-of-the-art self-supervised STR techniques.
arXiv Detail & Related papers (2024-11-18T01:11:47Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Sequential Visual and Semantic Consistency for Semi-supervised Text
Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training.
Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models.
This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Imposing Relation Structure in Language-Model Embeddings Using
Contrastive Learning [30.00047118880045]
We propose a novel contrastive learning framework that trains sentence embeddings to encode the relations in a graph structure.
The resulting relation-aware sentence embeddings achieve state-of-the-art results on the relation extraction task.
arXiv Detail & Related papers (2021-09-02T10:58:27Z) - Text is Text, No Matter What: Unifying Text Recognition using Knowledge
Distillation [41.43280922432707]
We argue for their unification -- we aim for a single model that can compete favourably with two separate state-of-the-art STR and HTR models.
We first show that cross-utilisation of STR and HTR models trigger significant performance drops due to differences in their inherent challenges.
We then tackle their union by introducing a knowledge distillation (KD) based framework.
arXiv Detail & Related papers (2021-07-26T10:10:34Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z) - Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [52.86058031919856]
We propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition.
GSRM is introduced to capture global semantic context through multi-way parallel transmission.
Results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method.
arXiv Detail & Related papers (2020-03-27T09:19:25Z) - SCATTER: Selective Context Attentional Scene Text Recognizer [16.311256552979835]
Scene Text Recognition (STR) is the task of recognizing text against complex image backgrounds.
Current state-of-the-art (SOTA) methods still struggle to recognize text written in arbitrary shapes.
We introduce a novel architecture for STR, named Selective Context ATtentional Text Recognizer (SCATTER)
arXiv Detail & Related papers (2020-03-25T09:20:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.