Related papers: Consistent text-to-image generation via scene de-contextualization

Consistent text-to-image generation via scene de-contextualization

URL: http://arxiv.org/abs/2510.14553v1
Date: Thu, 16 Oct 2025 10:54:49 GMT
Title: Consistent text-to-image generation via scene de-contextualization
Authors: Song Tang, Peihao Gong, Kunyu Li, Kai Guo, Boyu Wang, Mao Ye, Jianwei Zhang, Xiatian Zhu,
Abstract summary: Consistent text-to-image (T2I) generation often fails due to a phenomenon called identity (ID) shift.<n>This paper reveals that a key source of ID shift is the native correlation between subject and scene context.<n>We propose Scene De-Conization (SDeC) that imposes an inversion process of T2I's built-in scene contextualization.
Score: 48.19924216489272
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I's built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt's embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.

Related papers

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer [21.788582116033684]
Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video.<n>Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency.<n>We propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping to the video domain.
arXiv Detail & Related papers (2026-01-04T08:07:11Z)
Geometry-Aware Scene-Consistent Image Generation [14.644679152141904]
We study geometry-aware scene-consistent image generation.<n>The goal is to synthesize an output image that preserves the same physical environment as the reference scene.<n>We introduce two key contributions: (i) a scene-consistent data construction pipeline that generates diverse, geometrically-grounded training pairs, and (ii) a novel geometry-guided attention loss.
arXiv Detail & Related papers (2025-12-14T08:35:04Z)
Infinite-Story: A Training-Free Consistent Text-to-Image Generation [21.872330710303036]
We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation.<n>Our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency.<n>Our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models.
arXiv Detail & Related papers (2025-11-17T05:46:16Z)
SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency [47.20554570948312]
SceneDecorator is a training-free framework that employs VLM-Guided Scene Planning to ensure narrative coherence across different scenes in a global-to-local'' manner.<n>Extensive experiments demonstrate the superior performance of SceneDecorator, highlighting its potential to unleash creativity in the fields of arts, films, and games.
arXiv Detail & Related papers (2025-10-27T04:19:22Z)
Personalized Face Super-Resolution with Identity Decoupling and Fitting [50.473357681579664]
In extreme degradation scenarios, critical attributes and ID information are often severely lost in the input image.<n>Existing methods tend to generate hallucinated faces under such conditions, producing restored images lacking authentic ID constraints.<n>We propose a novel FSR method with Identity Decoupling and Fitting (IDFSR) to enhance ID restoration under large scaling factors.
arXiv Detail & Related papers (2025-08-13T02:33:11Z)
Subject-Consistent and Pose-Diverse Text-to-Image Generation [36.67159307721023]
We propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi.<n>It enables consistent subject generation with diverse pose and layout.<n>CoDi achieves both better visual perception and stronger performance across all metrics.
arXiv Detail & Related papers (2025-07-11T08:15:56Z)
Replace in Translation: Boost Concept Alignment in Counterfactual Text-to-Image [53.09546752700792]
We propose a strategy to instruct this replacing process, which is called as Explicit Logical Narrative Prompt (ELNP)<n>We design a metric to calculate how many required concepts in the prompt can be covered averagely in the synthesized images.<n>The extensive experiments and qualitative comparisons demonstrate that our strategy can boost the concept alignment in counterfactual T2I.
arXiv Detail & Related papers (2025-05-20T13:27:52Z)
Insert Anything: Image Insertion via In-Context Editing in DiT [19.733787045511775]
We present a unified framework for reference-based image insertion that seamlessly integrates objects from reference images into target scenes under flexible, user-specified control guidance.<n>Our approach is trained once on our new AnyInsertion dataset--comprising 120K prompt-image pairs covering diverse tasks such as person, object, and garment insertion--and effortlessly generalizes to a wide range of insertion scenarios.
arXiv Detail & Related papers (2025-04-21T10:19:12Z)
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts.<n>They struggle to support the consistent generation of identity-preserving requirements for storytelling.<n>We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z)
Locate, Assign, Refine: Taming Customized Promptable Image Inpainting [22.163855501668206]
We introduce the multimodal promptable image inpainting project: a new task model, and data for taming customized image inpainting.<n>We propose LAR-Gen, a novel approach for image inpainting that enables seamless inpainting of specific region in images corresponding to the mask prompt.<n>Our LAR-Gen adopts a coarse-to-fine manner to ensure the context consistency of source image, subject identity consistency, local semantic consistency to the text description, and smoothness consistency.
arXiv Detail & Related papers (2024-03-28T16:07:55Z)
Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation [21.739328335601716]
This paper focuses on inserting accurate and interactive ID embedding into the Stable Diffusion Model for personalized generation. We propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background. Our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
arXiv Detail & Related papers (2024-01-31T11:52:33Z)
Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation. The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.