Related papers: Training-Free Sketch-Guided Diffusion with Latent Optimization

Training-Free Sketch-Guided Diffusion with Latent Optimization

URL: http://arxiv.org/abs/2409.00313v2
Date: Wed, 07 May 2025 06:28:51 GMT
Title: Training-Free Sketch-Guided Diffusion with Latent Optimization
Authors: Sandra Zhang Ding, Jiafeng Mao, Kiyoharu Aizawa,
Abstract summary: We propose an innovative training-free pipeline that extends existing text-to-image generation models to incorporate a sketch as an additional condition.<n>To generate new images with a layout and structure closely resembling the input sketch, we find that these core features of a sketch can be tracked with the cross-attention maps of diffusion models.<n>We introduce latent optimization, a method that refines the noisy latent at each intermediate step of the generation process.
Score: 22.94468603089249
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Based on recent advanced diffusion models, Text-to-image (T2I) generation models have demonstrated their capabilities to generate diverse and high-quality images. However, leveraging their potential for real-world content creation, particularly in providing users with precise control over the image generation result, poses a significant challenge. In this paper, we propose an innovative training-free pipeline that extends existing text-to-image generation models to incorporate a sketch as an additional condition. To generate new images with a layout and structure closely resembling the input sketch, we find that these core features of a sketch can be tracked with the cross-attention maps of diffusion models. We introduce latent optimization, a method that refines the noisy latent at each intermediate step of the generation process using cross-attention maps to ensure that the generated images adhere closely to the desired structure outlined in the reference sketch. Through latent optimization, our method enhances the accuracy of image generation, offering users greater control and customization options in content creation.

Related papers

Policy Optimized Text-to-Image Pipeline Design [72.87655664038617]
We introduce a novel reinforcement learning-based framework for text-to-image generation.<n>Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations.<n>We then implement a two-phase training strategy: initial vocabulary training followed by GRPO-based optimization.
arXiv Detail & Related papers (2025-05-27T17:50:47Z)
Training-free Stylized Text-to-Image Generation with Fast Inference [24.55785152141884]
We propose a novel stylized image generation method leveraging a pre-trained large-scale diffusion model.<n>We exploit the self-consistency property of latent consistency models to extract the representative style statistics.<n>We then introduce the norm mixture of self-attention, which enables the model to query the most relevant style patterns.
arXiv Detail & Related papers (2025-05-25T09:38:23Z)
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
Boosting Generative Image Modeling via Joint Image-Feature Synthesis [10.32324138962724]
We introduce a novel generative image modeling framework that seamlessly bridges the gap by leveraging a diffusion model to jointly model low-level image latents. Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance.
arXiv Detail & Related papers (2025-04-22T17:41:42Z)
SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation [57.47730473674261]
We introduce SwiftSketch, a model for image-conditioned vector sketch generation that can produce high-quality sketches in less than a second. SwiftSketch operates by progressively denoising stroke control points sampled from a Gaussian distribution. ControlSketch is a method that enhances SDS-based techniques by incorporating precise spatial control through a depth-aware ControlNet.
arXiv Detail & Related papers (2025-02-12T18:57:12Z)
High-Resolution Image Synthesis via Next-Token Prediction [19.97037318862443]
We introduce textbfD-JEPA$cdot$T2I, an autoregressive model based on continuous tokens to generate high-quality, photorealistic images at arbitrary resolutions, up to 4K. For the first time, we achieve state-of-the-art high-resolution image synthesis via next-token prediction.
arXiv Detail & Related papers (2024-11-22T09:08:58Z)
Illustrious: an Open Advanced Illustration Model [7.428509329724737]
We develop a text-to-image anime image generative model, called Illustrious, to achieve high resolution, dynamic color range images, and high restoration ability. We focus on three critical approaches for model improvement. First, we delve into the significance of the batch size and dropout control, which enables faster learning of controllable token based concept activations. Second, we increase the training resolution of images, affecting the accurate depiction of character anatomy in much higher resolution, extending its generation capability over 20MP with proper methods.
arXiv Detail & Related papers (2024-09-30T04:59:12Z)
ZePo: Zero-Shot Portrait Stylization with Faster Sampling [61.14140480095604]
This paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps. We propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control.
arXiv Detail & Related papers (2024-08-10T08:53:41Z)
Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss. Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z)
Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning [40.06403155373455]
We propose a novel reinforcement learning framework for personalized text-to-image generation. Our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment.
arXiv Detail & Related papers (2024-07-09T08:11:53Z)
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis [48.9652334528436]
We introduce an innovative, training-free approach FouriScale from the perspective of frequency domain analysis. We replace the original convolutional layers in pre-trained diffusion models by incorporating a dilation technique along with a low-pass operation. Our method successfully balances the structural integrity and fidelity of generated images, achieving an astonishing capacity of arbitrary-size, high-resolution, and high-quality generation.
arXiv Detail & Related papers (2024-03-19T17:59:33Z)
Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z)
DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators [56.994967294931286]
We introduce DreamDrone, a novel zero-shot and training-free pipeline for generating flythrough scenes from textual prompts. We advocate explicitly warping the intermediate latent code of the pre-trained text-to-image diffusion model for high-quality image generation and unbounded generalization ability.
arXiv Detail & Related papers (2023-12-14T08:42:26Z)
Reference-based Image Composition with Sketch via Structure-aware Diffusion Model [38.1193912666578]
We introduce a multi-input-conditioned image composition model that incorporates a sketch as a novel modal, alongside a reference image. Thanks to the edge-level controllability using sketches, our method enables a user to edit or complete an image sub-part. Our framework fine-tunes a pre-trained diffusion model to complete missing regions using the reference image while maintaining sketch guidance.
arXiv Detail & Related papers (2023-03-31T06:12:58Z)
End-to-End Diffusion Latent Optimization Improves Classifier Guidance [81.27364542975235]
Direct Optimization of Diffusion Latents (DOODL) is a novel guidance method. It enables plug-and-play guidance by optimizing diffusion latents. It outperforms one-step classifier guidance on computational and human evaluation metrics.
arXiv Detail & Related papers (2023-03-23T22:43:52Z)
BIGRoC: Boosting Image Generation via a Robust Classifier [27.66648389933265]
We propose a general model-agnostic technique for improving the image quality and the distribution fidelity of generated images. Our method, termed BIGRoC, is based on a post-processing procedure via the guidance of a given robust classifier.
arXiv Detail & Related papers (2021-08-08T18:05:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.