Related papers: Don't Forget Me: Accurate Background Recovery for Text Removal via Modeling Local-Global Context

Don't Forget Me: Accurate Background Recovery for Text Removal via Modeling Local-Global Context

URL: http://arxiv.org/abs/2207.10273v1
Date: Thu, 21 Jul 2022 02:52:42 GMT
Title: Don't Forget Me: Accurate Background Recovery for Text Removal via Modeling Local-Global Context
Authors: Chongyu Liu, Lianwen Jin, Yuliang Liu, Canjie Luo, Bangdong Chen, Fengjun Guo, and Kai Ding
Abstract summary: We propose a Contextual-guided Text Removal Network, termed as CTRNet. CTRNet explores both low-level structure and high-level discriminative context feature as prior knowledge to guide the process of background restoration. Experiments on benchmark datasets, SCUT-EnsText and SCUT-Syn show that CTRNet significantly outperforms the existing state-of-the-art methods.
Score: 36.405779156685966
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text removal has attracted increasingly attention due to its various applications on privacy protection, document restoration, and text editing. It has shown significant progress with deep neural network. However, most of the existing methods often generate inconsistent results for complex background. To address this issue, we propose a Contextual-guided Text Removal Network, termed as CTRNet. CTRNet explores both low-level structure and high-level discriminative context feature as prior knowledge to guide the process of background restoration. We further propose a Local-global Content Modeling (LGCM) block with CNNs and Transformer-Encoder to capture local features and establish the long-term relationship among pixels globally. Finally, we incorporate LGCM with context guidance for feature modeling and decoding. Experiments on benchmark datasets, SCUT-EnsText and SCUT-Syn show that CTRNet significantly outperforms the existing state-of-the-art methods. Furthermore, a qualitative experiment on examination papers also demonstrates the generalization ability of our method. The codes and supplement materials are available at https://github.com/lcy0604/CTRNet.

Related papers

TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering [76.53315206999231]
TextPecker is a plug-and-play structural anomaly perceptive RL strategy.<n>It mitigates noisy reward signals and works with any textto-image generators.<n>It significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering.
arXiv Detail & Related papers (2026-02-24T13:40:23Z)
HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition [4.5311655360445515]
Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data.<n>We propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies.<n>We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure.
arXiv Detail & Related papers (2025-12-04T17:35:05Z)
Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders [14.655107789528673]
We introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders.<n>We propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks.<n>Our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics.
arXiv Detail & Related papers (2025-06-05T05:23:10Z)
DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation [63.781450025764904]
We propose DynamiCtrl, a novel framework for human animation in video DiT architecture.<n>We use a shared VAE encoder for human images and driving poses, unifying them into a common latent space.<n>We also introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context.
arXiv Detail & Related papers (2025-03-27T08:07:45Z)
Explicit Relational Reasoning Network for Scene Text Detection [20.310201743941196]
We introduce an explicit reasoning network (ERRNet) to elegantly model the component relationships without post-processing. ERRNet consistently achieves state-of-the-art accuracy while holding highly competitive inference speed.
arXiv Detail & Related papers (2024-12-19T09:51:45Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval [73.82017200889906]
Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. We propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts.
arXiv Detail & Related papers (2024-01-19T09:58:06Z)
CoSeR: Bridging Image and Language for Cognitive Super-Resolution [74.24752388179992]
We introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding. To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention"
arXiv Detail & Related papers (2023-11-27T16:33:29Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision. We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder. We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z)
A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution [13.934846626570286]
Scene text image super-resolution aims to increase the resolution and readability of the text in low-resolution images. It remains difficult to reconstruct high-resolution images for spatially deformed texts, especially rotated and curve-shaped ones. We propose a CNN based Text ATTention network (TATT) to address this problem.
arXiv Detail & Related papers (2022-03-17T15:28:29Z)
Contextual Attention Network: Transformer Meets U-Net [0.0]
convolutional neural networks (CNN) have become the de facto standard and attained immense success in medical image segmentation. However, CNN based methods fail to build long-range dependencies and global context connections. Recent articles have exploited Transformer variants for medical image segmentation tasks.
arXiv Detail & Related papers (2022-03-02T21:10:24Z)
TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text. Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy. We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z)
Global Context Aware RCNN for Object Detection [1.1939762265857436]
We propose a novel end-to-end trainable framework, called Global Context Aware (GCA) RCNN. The core component of GCA framework is a context aware mechanism, in which both global feature pyramid and attention strategies are used for feature extraction and feature refinement. In the end, we also present a lightweight version of our method, which only slightly increases model complexity and computational burden.
arXiv Detail & Related papers (2020-12-04T14:56:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.