Sequence-to-Sequence Contrastive Learning for Text Recognition
- URL: http://arxiv.org/abs/2012.10873v1
- Date: Sun, 20 Dec 2020 09:07:41 GMT
- Title: Sequence-to-Sequence Contrastive Learning for Text Recognition
- Authors: Aviad Aberdam, Ron Litman, Shahar Tsiper, Oron Anschel, Ron Slossberg,
Shai Mazor, R. Manmatha, Pietro Perona
- Abstract summary: We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations.
Experiments on handwritten text and on scene text show that when a text decoder is trained on the learned representations, our method outperforms non-sequential contrastive methods.
- Score: 29.576864819760498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a framework for sequence-to-sequence contrastive learning (SeqCLR)
of visual representations, which we apply to text recognition. To account for
the sequence-to-sequence structure, each feature map is divided into different
instances over which the contrastive loss is computed. This operation enables
us to contrast in a sub-word level, where from each image we extract several
positive pairs and multiple negative examples. To yield effective visual
representations for text recognition, we further suggest novel augmentation
heuristics, different encoder architectures and custom projection heads.
Experiments on handwritten text and on scene text show that when a text decoder
is trained on the learned representations, our method outperforms
non-sequential contrastive methods. In addition, when the amount of supervision
is reduced, SeqCLR significantly improves performance compared with supervised
training, and when fine-tuned with 100% of the labels, our method achieves
state-of-the-art results on standard handwritten text recognition benchmarks.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Out of Length Text Recognition with Sub-String Matching [54.63761108308825]
In this paper, we term this task Out of Length (OOL) text recognition.
We propose a novel method called OOL Text Recognition with sub-String Matching (SMTR)
SMTR comprises two cross-attention-based modules: one encodes a sub-string containing multiple characters into next and previous queries, and the other employs the queries to attend to the image features.
arXiv Detail & Related papers (2024-07-17T05:02:17Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning.
CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive
Text Summarization [15.367455931848252]
We present a sequence-to-sequence (seq2seq) autoencoder via contrastive learning for abstractive text summarization.
Our model adopts a standard Transformer-based architecture with a multi-layer bi-directional encoder and an auto-regressive decoder.
We conduct experiments on two datasets and demonstrate that our model outperforms many existing benchmarks.
arXiv Detail & Related papers (2021-08-26T18:45:13Z) - Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images.
A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding.
We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z) - SCATTER: Selective Context Attentional Scene Text Recognizer [16.311256552979835]
Scene Text Recognition (STR) is the task of recognizing text against complex image backgrounds.
Current state-of-the-art (SOTA) methods still struggle to recognize text written in arbitrary shapes.
We introduce a novel architecture for STR, named Selective Context ATtentional Text Recognizer (SCATTER)
arXiv Detail & Related papers (2020-03-25T09:20:28Z) - ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation [0.9542023122304099]
We present ScrabbleGAN, a semi-supervised approach to synthesize handwritten text images.
ScrabbleGAN relies on a novel generative model which can generate images of words with an arbitrary length.
arXiv Detail & Related papers (2020-03-23T21:41:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.