DanceText: A Training-Free Layered Framework for Controllable Multilingual Text Transformation in Images
- URL: http://arxiv.org/abs/2504.14108v2
- Date: Fri, 26 Sep 2025 02:03:49 GMT
- Title: DanceText: A Training-Free Layered Framework for Controllable Multilingual Text Transformation in Images
- Authors: Zhenyu Yu, Mohd Yamani Idna Idris, Hua Wang, Pei Wang, Rizwan Qureshi, Shaina Raza, Aman Chadha, Yong Xiang, Zhixiang Chen,
- Abstract summary: DanceText is a training-free framework for multilingual text editing in images.<n>It supports complex geometric transformations and achieves seamless foreground-background integration.
- Score: 28.48453375674059
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios. Code is avaible at https://github.com/YuZhenyuLindy/DanceText.git.
Related papers
- TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering [18.337757379089037]
We introduce TextEditBench, a comprehensive evaluation benchmark for text-centric regions in images.<n>Our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies.<n>We also propose a novel evaluation dimension, Semantic Expectation, which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment.
arXiv Detail & Related papers (2025-12-18T07:37:08Z) - Qwen-Image Technical Report [86.46471547116158]
We present Qwen-Image, an image generation foundation model that achieves significant advances in complex text rendering and precise image editing.<n>We design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing.<n>Qwen-Image performs exceptionally well in alphabetic languages such as English, and also achieves remarkable progress on more challenging logographic languages like Chinese.
arXiv Detail & Related papers (2025-08-04T11:49:20Z) - WordCon: Word-level Typography Control in Scene Text Rendering [12.00205643907721]
We construct a word-level controlled scene text dataset and introduce the Text-Image Alignment framework.<n>We also propose WordCon, a hybrid parameter-efficient fine-tuning (PEFT) method.<n>The datasets and source code will be available for academic use.
arXiv Detail & Related papers (2025-06-26T14:00:38Z) - ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints [13.2441524021269]
ShapeShift is a text-guided image-to-image translation task that requires rearranging the input set of rigid shapes into non-overlapping configurations.<n>We introduce a content-aware collision resolution mechanism that applies minimal semantically coherent adjustments when overlaps occur.<n>Our approach yields interpretable compositions where spatial relationships clearly embody the textual prompt.
arXiv Detail & Related papers (2025-03-18T20:48:58Z) - Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation [17.552733309504486]
In real-world images, slanted or curved texts, especially those on cans, banners, or badges, appear as frequently as flat texts due to artistic design or layout constraints.<n>We introduce a new training-free framework, STGen, which accurately generates visual texts in challenging scenarios.
arXiv Detail & Related papers (2025-01-10T11:44:59Z) - First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending [5.3798706094384725]
We propose a new visual text blending paradigm including both creating backgrounds and rendering texts.
Specifically, a background generator is developed to produce high-fidelity and text-free natural images.
We also explore several downstream applications based on our method, including scene text dataset synthesis for boosting scene text detectors.
arXiv Detail & Related papers (2024-10-14T05:23:43Z) - TextMaster: A Unified Framework for Realistic Text Editing via Glyph-Style Dual-Control [5.645654441900668]
We propose TextMaster, a solution capable of accurately editing text across various scenarios and image regions.<n>Our method enhances the accuracy and fidelity of text rendering by incorporating high-resolution standard glyph information.<n>We also propose a novel style injection technique that enables controllable style transfer for the injected text.
arXiv Detail & Related papers (2024-10-13T15:39:39Z) - InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models [20.90990477016161]
We introduce Geometry-Inverse-Meet-Pixel-Insert, short for GEO, an exceptionally versatile image editing technique.
Our approach seamlessly integrates text prompts and image prompts to yield diverse and precise editing outcomes.
arXiv Detail & Related papers (2024-09-18T06:43:40Z) - Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks.
ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z) - AnyTrans: Translate AnyText in the Image with Large Scale Models [88.5887934499388]
This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI)
Our framework incorporates contextual cues from both textual and visual elements during translation.
We have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
arXiv Detail & Related papers (2024-06-17T11:37:48Z) - TextCraftor: Your Text Encoder Can be Image Quality Controller [65.27457900325462]
Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation.
We propose a proposed fine-tuning approach, TextCraftor, to enhance the performance of text-to-image diffusion models.
arXiv Detail & Related papers (2024-03-27T19:52:55Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - SPIRE: Semantic Prompt-Driven Image Restoration [66.26165625929747]
We develop SPIRE, a Semantic and restoration Prompt-driven Image Restoration framework.
Our approach is the first framework that supports fine-level instruction through language-based quantitative specification of the restoration strength.
Our experiments demonstrate the superior restoration performance of SPIRE compared to the state of the arts.
arXiv Detail & Related papers (2023-12-18T17:02:30Z) - Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
We show that this simple approach enables flexible editing that is compatible with current image generation models.
Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z) - Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment [10.82748329166797]
We propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a textitfrozen pre-trained diffusion model.<n>Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt.
arXiv Detail & Related papers (2023-08-30T08:40:15Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - Towards Full-to-Empty Room Generation with Structure-Aware Feature
Encoding and Soft Semantic Region-Adaptive Normalization [67.64622529651677]
We propose a simple yet effective adjusted fully differentiable soft semantic region-adaptive normalization module (softSEAN) block.
Our approach besides the advantages of mitigating training complexity and non-differentiability issues surpasses the compared methods both quantitatively and qualitatively.
Our softSEAN block can be used as a drop-in module for existing discriminative and generative models.
arXiv Detail & Related papers (2021-12-10T09:00:13Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.