Related papers: Training-Free Consistent Text-to-Image Generation

Training-Free Consistent Text-to-Image Generation

URL: http://arxiv.org/abs/2402.03286v3
Date: Thu, 30 May 2024 11:42:15 GMT
Title: Training-Free Consistent Text-to-Image Generation
Authors: Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, Yuval Atzmon,
Abstract summary: Text-to-image models can portray the same subject across diverse prompts. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects. We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
Score: 80.4814768762066
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

Related papers

Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt [14.734857939203811]
We propose a training-free approach that addresses semantic entanglement from a subject perspective.<n>Our approach significantly improves both subject consistency and text alignment over existing baselines.
arXiv Detail & Related papers (2025-12-18T11:55:06Z)
StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization [31.250596607318364]
Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model's pre-existing capabilities.<n>This paper proposes an efficient consistent-subject-generation method.<n> Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios.
arXiv Detail & Related papers (2025-07-31T11:24:40Z)
Bringing Characters to New Stories: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting [71.29100512700064]
We present T-Prompter, a training-free method for theme-specific image generation. T-Prompter integrates reference images into generative models, allowing users to seamlessly specify the target theme. Our approach enables consistent story generation, character design, realistic character generation, and style-guided image generation.
arXiv Detail & Related papers (2025-01-26T19:01:19Z)
Nested Attention: Semantic-aware Attention Values for Concept Personalization [78.90196530697897]
We introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers.<n>Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.
arXiv Detail & Related papers (2025-01-02T18:52:11Z)
Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate. We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references. Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z)
Learning to Customize Text-to-Image Diffusion In Diverse Context [23.239646132590043]
Most text-to-image customization techniques fine-tune models on a small set of emphpersonal concept images captured in minimal contexts. We resort to diversifying the context of these personal concepts by simply creating a contextually rich set of text prompts. Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space. Our approach does not require any architectural modifications, making it highly compatible with existing text-to-image customization methods.
arXiv Detail & Related papers (2024-10-14T00:53:59Z)
MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [6.4680449907623006]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects. The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z)
Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z)
Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning. Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z)
PALP: Prompt Aligned Personalization of Text-to-Image Models [68.91005384187348]
Existing personalization methods compromise personalization ability or the alignment to complex prompts. We propose a new approach focusing on personalization methods for a emphsingle prompt to address this issue. Our method excels in improving text alignment, enabling the creation of images with complex and intricate prompts.
arXiv Detail & Related papers (2024-01-11T18:35:33Z)
Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD) In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z)
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)
Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models [43.32978092618245]
We present a novel neural pipeline for generating a coherent storybook from the plain text of a story. We leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images.
arXiv Detail & Related papers (2023-02-08T06:24:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.