Text-Conditioned Background Generation for Editable Multi-Layer Documents
- URL: http://arxiv.org/abs/2512.17151v1
- Date: Fri, 19 Dec 2025 01:10:24 GMT
- Title: Text-Conditioned Background Generation for Editable Multi-Layer Documents
- Authors: Taewon Kang, Joseph K J, Chris Tensmeyer, Jihyung Kil, Wanrong Zhu, Ming C. Lin, Vlad I. Morariu,
- Abstract summary: We present a framework for document-centric background generation with multi-page editing and thematic continuity.<n>Our training-free framework produces visually coherent, text-preserving documents, bridging generative modeling with natural design.
- Score: 32.896370365677136
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present a framework for document-centric background generation with multi-page editing and thematic continuity. To ensure text regions remain readable, we employ a \emph{latent masking} formulation that softly attenuates updates in the diffusion space, inspired by smooth barrier functions in physics and numerical optimization. In addition, we introduce \emph{Automated Readability Optimization (ARO)}, which automatically places semi-transparent, rounded backing shapes behind text regions. ARO determines the minimal opacity needed to satisfy perceptual contrast standards (WCAG 2.2) relative to the underlying background, ensuring readability while maintaining aesthetic harmony without human intervention. Multi-page consistency is maintained through a summarization-and-instruction process, where each page is distilled into a compact representation that recursively guides subsequent generations. This design reflects how humans build continuity by retaining prior context, ensuring that visual motifs evolve coherently across an entire document. Our method further treats a document as a structured composition in which text, figures, and backgrounds are preserved or regenerated as separate layers, allowing targeted background editing without compromising readability. Finally, user-provided prompts allow stylistic adjustments in color and texture, balancing automated consistency with flexible customization. Our training-free framework produces visually coherent, text-preserving, and thematically aligned documents, bridging generative modeling with natural design workflows.
Related papers
- All-in-One Conditioning for Text-to-Image Synthesis [45.22434803596108]
We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures.<n>We introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference.<n>This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.
arXiv Detail & Related papers (2026-02-09T20:16:19Z) - ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation [14.341691123354195]
ASemconsist enables explicit semantic control over character identity without sacrificing prompt alignment.<n>Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs.
arXiv Detail & Related papers (2025-12-29T07:06:57Z) - Loom: Diffusion-Transformer for Interleaved Generation [17.092197559386463]
Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence.<n>We present Loom, a unified diffusion-transformer framework for interleaved text-image generation.
arXiv Detail & Related papers (2025-12-20T07:33:59Z) - Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt [14.734857939203811]
We propose a training-free approach that addresses semantic entanglement from a subject perspective.<n>Our approach significantly improves both subject consistency and text alignment over existing baselines.
arXiv Detail & Related papers (2025-12-18T11:55:06Z) - ReMix: Towards a Unified View of Consistent Character Generation and Editing [22.04681457337335]
ReMix is a unified framework for character-consistent generation and editing.<n>It constitutes two core components: the ReMix Module and IP-ControlNet.<n>ReMix supports a wide range of tasks, including personalized generation, image editing, style transfer, and multi-condition synthesis.
arXiv Detail & Related papers (2025-10-11T10:31:56Z) - Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [131.33758144860988]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z) - Recognition-Synergistic Scene Text Editing [41.91470824144351]
Scene text editing aims to modify text content within scene images while maintaining style consistency.<n>Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content.<n>We introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing.
arXiv Detail & Related papers (2025-03-11T12:50:38Z) - Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation [17.552733309504486]
In real-world images, slanted or curved texts, especially those on cans, banners, or badges, appear as frequently as flat texts due to artistic design or layout constraints.<n>We introduce a new training-free framework, STGen, which accurately generates visual texts in challenging scenarios.
arXiv Detail & Related papers (2025-01-10T11:44:59Z) - Towards Visual Text Design Transfer Across Languages [49.78504488452978]
We introduce a novel task of Multimodal Style Translation (MuST-Bench)
MuST-Bench is a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems.
In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions.
arXiv Detail & Related papers (2024-10-24T15:15:01Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Boosting Modern and Historical Handwritten Text Recognition with
Deformable Convolutions [52.250269529057014]
Handwritten Text Recognition (HTR) in free-volution pages is a challenging image understanding task.
We propose to adopt deformable convolutions, which can deform depending on the input at hand and better adapt to the geometric variations of the text.
arXiv Detail & Related papers (2022-08-17T06:55:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.