SimAN: Exploring Self-Supervised Representation Learning of Scene Text
via Similarity-Aware Normalization
- URL: http://arxiv.org/abs/2203.10492v2
- Date: Tue, 22 Mar 2022 12:06:41 GMT
- Title: SimAN: Exploring Self-Supervised Representation Learning of Scene Text
via Similarity-Aware Normalization
- Authors: Canjie Luo, Lianwen Jin, Jingdong Chen
- Abstract summary: Self-supervised representation learning has drawn considerable attention from the scene text recognition community.
We tackle the issue by formulating the representation learning scheme in a generative manner.
We propose a Similarity-Aware Normalization (SimAN) module to identify the different patterns and align the corresponding styles from the guiding patch.
- Score: 66.35116147275568
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently self-supervised representation learning has drawn considerable
attention from the scene text recognition community. Different from previous
studies using contrastive learning, we tackle the issue from an alternative
perspective, i.e., by formulating the representation learning scheme in a
generative manner. Typically, the neighboring image patches among one text line
tend to have similar styles, including the strokes, textures, colors, etc.
Motivated by this common sense, we augment one image patch and use its
neighboring patch as guidance to recover itself. Specifically, we propose a
Similarity-Aware Normalization (SimAN) module to identify the different
patterns and align the corresponding styles from the guiding patch. In this
way, the network gains representation capability for distinguishing complex
patterns such as messy strokes and cluttered backgrounds. Experiments show that
the proposed SimAN significantly improves the representation quality and
achieves promising performance. Moreover, we surprisingly find that our
self-supervised generative network has impressive potential for data synthesis,
text image editing, and font interpolation, which suggests that the proposed
SimAN has a wide range of practical applications.
Related papers
- Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Compositional Mixture Representations for Vision and Text [43.2292923754127]
A common representation space between vision and language allows deep networks to relate objects in the image to the corresponding semantic meaning.
We present a model that learns a shared Gaussian mixture representation imposing the compositionality of the text onto the visual domain without having explicit location supervision.
arXiv Detail & Related papers (2022-06-13T18:16:40Z) - Self-Supervised Image-to-Text and Text-to-Image Synthesis [23.587581181330123]
We propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces.
In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder.
arXiv Detail & Related papers (2021-12-09T13:54:56Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel
Appearance Invariant Semantic Representations [77.3590853897664]
This work presents a self-supervised method to learn dense semantically rich visual embeddings for images inspired by methods for learning word embeddings in NLP.
arXiv Detail & Related papers (2021-11-24T12:27:30Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images.
A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding.
We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.