TextMaster: A Unified Framework for Realistic Text Editing via Glyph-Style Dual-Control
- URL: http://arxiv.org/abs/2410.09879v2
- Date: Tue, 05 Aug 2025 16:34:56 GMT
- Title: TextMaster: A Unified Framework for Realistic Text Editing via Glyph-Style Dual-Control
- Authors: Zhenyu Yan, Jian Wang, Aoqiang Wang, Yuhan Li, Wenxiang Shang, Ran Lin,
- Abstract summary: We propose TextMaster, a solution capable of accurately editing text across various scenarios and image regions.<n>Our method enhances the accuracy and fidelity of text rendering by incorporating high-resolution standard glyph information.<n>We also propose a novel style injection technique that enables controllable style transfer for the injected text.
- Score: 5.645654441900668
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In image editing tasks, high-quality text editing capabilities can significantly reduce both human and material resource costs. Existing methods, however, face significant limitations in terms of stroke accuracy for complex text and controllability of generated text styles. To address these challenges, we propose TextMaster, a solution capable of accurately editing text across various scenarios and image regions, while ensuring proper layout and controllable text style. Our method enhances the accuracy and fidelity of text rendering by incorporating high-resolution standard glyph information and applying perceptual loss within the text editing region. Additionally, we leverage an attention mechanism to compute intermediate layer bounding box regression loss for each character, enabling the model to learn text layout across varying contexts. Furthermore, we propose a novel style injection technique that enables controllable style transfer for the injected text. Through comprehensive experiments, we demonstrate the state-of-the-art performance of our method.
Related papers
- WordCon: Word-level Typography Control in Scene Text Rendering [12.00205643907721]
We construct a word-level controlled scene text dataset and introduce the Text-Image Alignment framework.<n>We also propose WordCon, a hybrid parameter-efficient fine-tuning (PEFT) method.<n>The datasets and source code will be available for academic use.
arXiv Detail & Related papers (2025-06-26T14:00:38Z) - DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics [71.78350994830885]
We present a novel approach to improving text-guided image editing using diffusion-based models.<n>Our method uses visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance.<n>To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task.
arXiv Detail & Related papers (2025-03-21T02:14:03Z) - SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild [55.619708995575785]
The text in natural scene images needs to meet the following four key criteria.<n>The generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks.<n>The generated images have superior utility in OCR tasks like text detection and text recognition.
arXiv Detail & Related papers (2025-01-06T12:09:08Z) - TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control [5.3798706094384725]
We propose TextCtrl, a diffusion-based method that edits text with prior guidance control.
Our method consists of two key components: (i) By constructing fine-grained text style disentanglement and robust text structure representation, TextCtrl explicitly incorporates Style-Structure guidance into model design and network training, significantly improving text style consistency and rendering accuracy.
To fill the vacancy of the real-world STE evaluation benchmark, we create the first real-world image-pair dataset termed ScenePair for fair comparisons.
arXiv Detail & Related papers (2024-10-14T03:50:39Z) - TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles [12.182588762414058]
Scene text editing aims to modify texts on images while maintaining the style of newly generated text similar to the original.
Recent works leverage diffusion models, showing improved results, yet still face challenges.
We present emphTextMastero - a carefully designed multilingual scene text editing architecture based on latent diffusion models (LDMs)
arXiv Detail & Related papers (2024-08-20T08:06:09Z) - DragText: Rethinking Text Embedding in Point-based Image Editing [3.1923251959845214]
We show that during the progressive editing of an input image in a diffusion model, the text embedding remains constant.
We propose DragText, which optimize text embedding in conjunction with the dragging process to pair with the modified image embedding.
arXiv Detail & Related papers (2024-07-25T07:57:55Z) - ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting [8.397246652127793]
We propose a new pre-training method called OCR-Text Destylization Modeling (ODM)
ODM transfers diverse styles of text found in images to a uniform style based on the text prompt.
Our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks.
arXiv Detail & Related papers (2024-03-01T06:13:53Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition [56.968108142307976]
We propose a novel approach called Class-Aware Mask-guided feature refinement (CAM)
Our approach introduces canonical class-aware glyph masks to suppress background and text style noise.
By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion.
arXiv Detail & Related papers (2024-02-21T09:22:45Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
We show that this simple approach enables flexible editing that is compatible with current image generation models.
Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z) - FASTER: A Font-Agnostic Scene Text Editing and Rendering Framework [19.564048493848272]
Scene Text Editing (STE) is a challenging research problem, that primarily aims towards modifying existing texts in an image.
Existing style-transfer-based approaches have shown sub-par editing performance due to complex image backgrounds, diverse font attributes, and varying word lengths within the text.
We propose a novel font-agnostic scene text editing and rendering framework, named FASTER, for simultaneously generating text in arbitrary styles and locations.
arXiv Detail & Related papers (2023-08-05T15:54:06Z) - Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text.
Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously.
We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z) - Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy.
We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z) - A Text Attention Network for Spatial Deformation Robust Scene Text Image
Super-resolution [13.934846626570286]
Scene text image super-resolution aims to increase the resolution and readability of the text in low-resolution images.
It remains difficult to reconstruct high-resolution images for spatially deformed texts, especially rotated and curve-shaped ones.
We propose a CNN based Text ATTention network (TATT) to address this problem.
arXiv Detail & Related papers (2022-03-17T15:28:29Z) - MOST: A Multi-Oriented Scene Text Detector with Localization Refinement [67.35280008722255]
We propose a new algorithm for scene text detection, which puts forward a set of strategies to significantly improve the quality of text localization.
Specifically, a Text Feature Alignment Module (TFAM) is proposed to dynamically adjust the receptive fields of features.
A Position-Aware Non-Maximum Suppression (PA-NMS) module is devised to exclude unreliable ones.
arXiv Detail & Related papers (2021-04-02T14:34:41Z) - STEFANN: Scene Text Editor using Font Adaptive Neural Network [18.79337509555511]
We propose a method to modify text in an image at character-level.
We propose two different neural network architectures - (a) FANnet to achieve structural consistency with source font and (b) Colornet to preserve source color.
Our method works as a unified platform for modifying text in images.
arXiv Detail & Related papers (2019-03-04T11:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.