Bringing Characters to New Stories: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting
- URL: http://arxiv.org/abs/2501.15641v1
- Date: Sun, 26 Jan 2025 19:01:19 GMT
- Title: Bringing Characters to New Stories: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting
- Authors: Yuxin Zhang, Minyan Luo, Weiming Dong, Xiao Yang, Haibin Huang, Chongyang Ma, Oliver Deussen, Tong-Yee Lee, Changsheng Xu,
- Abstract summary: We present T-Prompter, a training-free method for theme-specific image generation.
T-Prompter integrates reference images into generative models, allowing users to seamlessly specify the target theme.
Our approach enables consistent story generation, character design, realistic character generation, and style-guided image generation.
- Score: 71.29100512700064
- License:
- Abstract: The stories and characters that captivate us as we grow up shape unique fantasy worlds, with images serving as the primary medium for visually experiencing these realms. Personalizing generative models through fine-tuning with theme-specific data has become a prevalent approach in text-to-image generation. However, unlike object customization, which focuses on learning specific objects, theme-specific generation encompasses diverse elements such as characters, scenes, and objects. Such diversity also introduces a key challenge: how to adaptively generate multi-character, multi-concept, and continuous theme-specific images (TSI). Moreover, fine-tuning approaches often come with significant computational overhead, time costs, and risks of overfitting. This paper explores a fundamental question: Can image generation models directly leverage images as contextual input, similarly to how large language models use text as context? To address this, we present T-Prompter, a novel training-free TSI method for generation. T-Prompter introduces visual prompting, a mechanism that integrates reference images into generative models, allowing users to seamlessly specify the target theme without requiring additional training. To further enhance this process, we propose a Dynamic Visual Prompting (DVP) mechanism, which iteratively optimizes visual prompts to improve the accuracy and quality of generated images. Our approach enables diverse applications, including consistent story generation, character design, realistic character generation, and style-guided image generation. Comparative evaluations against state-of-the-art personalization methods demonstrate that T-Prompter achieves significantly better results and excels in maintaining character identity preserving, style consistency and text alignment, offering a robust and flexible solution for theme-specific image generation.
Related papers
- One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts.
They struggle to support the consistent generation of identity-preserving requirements for storytelling.
We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z) - Object-level Visual Prompts for Compositional Image Generation [75.6085388740087]
We introduce a method for composing object-level visual prompts within a text-to-image diffusion model.
A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts.
We introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations.
arXiv Detail & Related papers (2025-01-02T18:59:44Z) - MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [6.4680449907623006]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.
The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z) - Visual Style Prompting with Swapping Self-Attention [26.511518230332758]
We propose a novel approach to produce a diverse range of images while maintaining specific style elements and nuances.
During the denoising process, we keep the query from original features while swapping the key and value with those from reference features in the late self-attention layers.
Our method demonstrates superiority over existing approaches, best reflecting the style of the references and ensuring that resulting images match the text prompts most accurately.
arXiv Detail & Related papers (2024-02-20T12:51:17Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image
Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods.
The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.