Stencil: Subject-Driven Generation with Context Guidance
- URL: http://arxiv.org/abs/2509.17120v1
- Date: Sun, 21 Sep 2025 15:19:08 GMT
- Title: Stencil: Subject-Driven Generation with Context Guidance
- Authors: Gordon Chen, Ziqi Huang, Cheston Tan, Ziwei Liu,
- Abstract summary: Recent text-to-image diffusion models can generate striking visuals from text prompts, but they often fail to maintain subject consistency across generations and contexts.<n>We present Stencil, a novel framework that jointly employs two diffusion models during inference.<n>Stencil excels at generating high-fidelity, novel renditions of the subject in less than a minute, delivering state-of-the-art performance and setting a new benchmark in subject-driven generation.
- Score: 46.39238993793864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent text-to-image diffusion models can generate striking visuals from text prompts, but they often fail to maintain subject consistency across generations and contexts. One major limitation of current fine-tuning approaches is the inherent trade-off between quality and efficiency. Fine-tuning large models improves fidelity but is computationally expensive, while fine-tuning lightweight models improves efficiency but compromises image fidelity. Moreover, fine-tuning pre-trained models on a small set of images of the subject can damage the existing priors, resulting in suboptimal results. To this end, we present Stencil, a novel framework that jointly employs two diffusion models during inference. Stencil efficiently fine-tunes a lightweight model on images of the subject, while a large frozen pre-trained model provides contextual guidance during inference, injecting rich priors to enhance generation with minimal overhead. Stencil excels at generating high-fidelity, novel renditions of the subject in less than a minute, delivering state-of-the-art performance and setting a new benchmark in subject-driven generation.
Related papers
- Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model [87.23753533733046]
We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities.<n>Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder.
arXiv Detail & Related papers (2025-05-29T16:15:48Z) - Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation [20.67671141789497]
We propose the first VAR-based approach for subject-driven generation.<n>We find that the early stages have a greater influence on the generation of subject than the latter stages.<n>Based on this finding, we propose scale-wise weighted tuning, which prioritizes coarser resolutions for promoting the model to focus on the subject-relevant information.
arXiv Detail & Related papers (2025-04-03T14:12:55Z) - Active Generation for Image Classification [45.93535669217115]
We propose to address the efficiency of image generation by focusing on the specific needs and characteristics of the model.
With a central tenet of active learning, our method, named ActGen, takes a training-aware approach to image generation.
arXiv Detail & Related papers (2024-03-11T08:45:31Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - FrameNeRF: A Simple and Efficient Framework for Few-shot Novel View
Synthesis [25.356376402671536]
FrameNeRF is designed to apply off-the-shelf fast high-fidelity NeRF models with fast training speed and high rendering quality for few-shot novel view synthesis tasks.
arXiv Detail & Related papers (2024-02-22T14:41:02Z) - Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models [67.68871360210208]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, can generate visuals with a high degree of consistency.<n>We propose a novel fine-tuning objective, dubbed Direct Consistency Optimization, which controls the deviation between fine-tuning and pretrained models.<n>We show that our approach achieves better prompt fidelity and subject fidelity than those post-optimized for merging regular fine-tuned models.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster
Image Generation [49.3016007471979]
Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks.
However, their widespread adoption is hindered by the high computational cost, which limits their real-time application.
We introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs.
arXiv Detail & Related papers (2023-10-02T17:59:18Z) - Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional
Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation.
We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z) - Conditional Generation from Unconditional Diffusion Models using
Denoiser Representations [94.04631421741986]
We propose adapting pre-trained unconditional diffusion models to new conditions using the learned internal representations of the denoiser network.
We show that augmenting the Tiny ImageNet training set with synthetic images generated by our approach improves the classification accuracy of ResNet baselines by up to 8%.
arXiv Detail & Related papers (2023-06-02T20:09:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.