Related papers: Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing

URL: http://arxiv.org/abs/2405.04377v1
Date: Tue, 7 May 2024 15:00:11 GMT
Title: Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing
Authors: Boqiang Zhang, Hongtao Xie, Zuan Gao, Yuxin Wang,
Abstract summary: Scene text images contain not only style information (font, background) but also content information (character, texture) Previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability.
Score: 47.421888361871254
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.

Related papers

TSAL: Few-shot Text Segmentation Based on Attribute Learning [21.413607725856263]
We propose TSAL, which leverages CLIP's prior knowledge to learn text attributes for segmentation. To reduce data dependency and improve text detection accuracy, the adaptive prompt-guided branch employs effective adaptive prompt templates. Experiments demonstrate that our method achieves SOTA performance on multiple text segmentation datasets under few-shot settings.
arXiv Detail & Related papers (2025-04-15T13:12:42Z)
Object-level Visual Prompts for Compositional Image Generation [75.6085388740087]
We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts. We introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations.
arXiv Detail & Related papers (2025-01-02T18:59:44Z)
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification. In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion. In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z)
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z)
Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition [63.6608759501803]
We propose to recognize artistic text at three levels. corner points are applied to guide the extraction of local features inside characters, considering the robustness of corner structures to appearance and shape. Secondly, we design a character contrastive loss to model the character-level feature, improving the feature representation for character classification. Thirdly, we utilize Transformer to learn the global feature on image-level and model the global relationship of the corner points.
arXiv Detail & Related papers (2022-07-31T14:11:05Z)
Knowledge Mining with Scene Text for Fine-Grained Recognition [53.74297368412834]
We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image. We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification. Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
arXiv Detail & Related papers (2022-03-27T05:54:00Z)
Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features. Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
Learning Semantic-Aligned Feature Representation for Text-based Person Search [8.56017285139081]
We propose a semantic-aligned embedding method for text-based person search. The feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features. Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
arXiv Detail & Related papers (2021-12-13T14:54:38Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
RewriteNet: Realistic Scene Text Image Generation via Editing Text in Real-world Image [17.715320405808935]
Scene text editing (STE) is a challenging task due to a complex intervention between text and style. We propose a novel representational learning-based STE model, referred to as RewriteNet. Our experiments demonstrate that RewriteNet achieves better quantitative and qualitative performance than other comparisons.
arXiv Detail & Related papers (2021-07-23T06:32:58Z)
STEFANN: Scene Text Editor using Font Adaptive Neural Network [18.79337509555511]
We propose a method to modify text in an image at character-level. We propose two different neural network architectures - (a) FANnet to achieve structural consistency with source font and (b) Colornet to preserve source color. Our method works as a unified platform for modifying text in images.
arXiv Detail & Related papers (2019-03-04T11:56:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.