Related papers: TextMaster: Universal Controllable Text Edit

TextMaster: Universal Controllable Text Edit

URL: http://arxiv.org/abs/2410.09879v1
Date: Sun, 13 Oct 2024 15:39:39 GMT
Title: TextMaster: Universal Controllable Text Edit
Authors: Aoqiang Wang, Jian Wang, Zhenyu Yan, Wenxiang Shang, Ran Lin, Zhao Zhang,
Abstract summary: We propose TextMaster, a solution capable of accurately editing text with high realism and proper layout in any scenario and image area. Our approach employs adaptive standard letter spacing as guidance during training and uses adaptive mask boosting to prevent the leakage of text position and size information. By injecting high-resolution standard font information and applying perceptual loss in the text editing area, we further enhance text rendering accuracy and fidelity.
Score: 5.7173370525015095
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In image editing tasks, high-quality text editing capabilities can significantly reduce human and material resource costs. Current methods rely heavily on training data based on OCR text segment detection, where the text is tightly aligned with the mask area. This reliance creates a strong dependency on the mask area and lacks modules for adjusting text spacing and size in various scenarios. When the amount of text to be edited does not match the modification area or when the mask area is too large, significant issues may arise. Furthermore, no existing methods have explored controllable style transfer for text editing.To address these challenges, we propose TextMaster, a solution capable of accurately editing text with high realism and proper layout in any scenario and image area. Our approach employs adaptive standard letter spacing as guidance during training and uses adaptive mask boosting to prevent the leakage of text position and size information. We also utilize an attention mechanism to calculate the bounding box regression loss for each character, making text layout methods learnable across different scenarios. By injecting high-resolution standard font information and applying perceptual loss in the text editing area, we further enhance text rendering accuracy and fidelity. Additionally, we achieve style consistency between the modified and target text through a novel style injection method. Extensive qualitative and quantitative evaluations demonstrate that our method outperforms all existing approaches.

Related papers

TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment [68.91073792449201]
We propose TextGuider, a training-free method that encourages accurate and complete text appearance.<n>Specifically, we analyze attention patterns in Multi-Modal Diffusion Transformer(MM-DiT) models, particularly for text-related tokens intended to be rendered in the image.<n>Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.
arXiv Detail & Related papers (2025-12-10T06:18:30Z)
SceneTextStylizer: A Training-Free Scene Text Style Transfer Framework with Diffusion Model [4.785724079054428]
SceneTextStylizer is a training-free diffusion-based framework for flexible and high-fidelity style transfer of text in scene images.<n>Our method enables prompt-guided style transformation specifically for text regions, while preserving both text readability and stylistic consistency.
arXiv Detail & Related papers (2025-10-13T02:11:57Z)
WordCon: Word-level Typography Control in Scene Text Rendering [12.00205643907721]
We construct a word-level controlled scene text dataset and introduce the Text-Image Alignment framework.<n>We also propose WordCon, a hybrid parameter-efficient fine-tuning (PEFT) method.<n>The datasets and source code will be available for academic use.
arXiv Detail & Related papers (2025-06-26T14:00:38Z)
DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics [71.78350994830885]
We present a novel approach to improving text-guided image editing using diffusion-based models.<n>Our method uses visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance.<n>To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task.
arXiv Detail & Related papers (2025-03-21T02:14:03Z)
SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild [55.619708995575785]
The text in natural scene images needs to meet the following four key criteria.<n>The generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks.<n>The generated images have superior utility in OCR tasks like text detection and text recognition.
arXiv Detail & Related papers (2025-01-06T12:09:08Z)
TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control [5.3798706094384725]
We propose TextCtrl, a diffusion-based method that edits text with prior guidance control. Our method consists of two key components: (i) By constructing fine-grained text style disentanglement and robust text structure representation, TextCtrl explicitly incorporates Style-Structure guidance into model design and network training, significantly improving text style consistency and rendering accuracy. To fill the vacancy of the real-world STE evaluation benchmark, we create the first real-world image-pair dataset termed ScenePair for fair comparisons.
arXiv Detail & Related papers (2024-10-14T03:50:39Z)
TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles [12.182588762414058]
Scene text editing aims to modify texts on images while maintaining the style of newly generated text similar to the original. Recent works leverage diffusion models, showing improved results, yet still face challenges. We present emphTextMastero - a carefully designed multilingual scene text editing architecture based on latent diffusion models (LDMs)
arXiv Detail & Related papers (2024-08-20T08:06:09Z)
DragText: Rethinking Text Embedding in Point-based Image Editing [3.1923251959845214]
We show that during the progressive editing of an input image in a diffusion model, the text embedding remains constant. We propose DragText, which optimize text embedding in conjunction with the dragging process to pair with the modified image embedding.
arXiv Detail & Related papers (2024-07-25T07:57:55Z)
ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting [8.397246652127793]
We propose a new pre-training method called OCR-Text Destylization Modeling (ODM) ODM transfers diverse styles of text found in images to a uniform style based on the text prompt. Our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks.
arXiv Detail & Related papers (2024-03-01T06:13:53Z)
Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z)
Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition [56.968108142307976]
We propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) Our approach introduces canonical class-aware glyph masks to suppress background and text style noise. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion.
arXiv Detail & Related papers (2024-02-21T09:22:45Z)
Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z)
Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. We show that this simple approach enables flexible editing that is compatible with current image generation models. Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z)
FASTER: A Font-Agnostic Scene Text Editing and Rendering Framework [19.564048493848272]
Scene Text Editing (STE) is a challenging research problem, that primarily aims towards modifying existing texts in an image. Existing style-transfer-based approaches have shown sub-par editing performance due to complex image backgrounds, diverse font attributes, and varying word lengths within the text. We propose a novel font-agnostic scene text editing and rendering framework, named FASTER, for simultaneously generating text in arbitrary styles and locations.
arXiv Detail & Related papers (2023-08-05T15:54:06Z)
Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text. Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously. We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z)
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z)
A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution [13.934846626570286]
Scene text image super-resolution aims to increase the resolution and readability of the text in low-resolution images. It remains difficult to reconstruct high-resolution images for spatially deformed texts, especially rotated and curve-shaped ones. We propose a CNN based Text ATTention network (TATT) to address this problem.
arXiv Detail & Related papers (2022-03-17T15:28:29Z)
MOST: A Multi-Oriented Scene Text Detector with Localization Refinement [67.35280008722255]
We propose a new algorithm for scene text detection, which puts forward a set of strategies to significantly improve the quality of text localization. Specifically, a Text Feature Alignment Module (TFAM) is proposed to dynamically adjust the receptive fields of features. A Position-Aware Non-Maximum Suppression (PA-NMS) module is devised to exclude unreliable ones.
arXiv Detail & Related papers (2021-04-02T14:34:41Z)
STEFANN: Scene Text Editor using Font Adaptive Neural Network [18.79337509555511]
We propose a method to modify text in an image at character-level. We propose two different neural network architectures - (a) FANnet to achieve structural consistency with source font and (b) Colornet to preserve source color. Our method works as a unified platform for modifying text in images.
arXiv Detail & Related papers (2019-03-04T11:56:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.