Related papers: StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models

StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2401.13942v2
Date: Fri, 10 May 2024 06:03:33 GMT
Title: StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models
Authors: Mohan Zhou, Yalong Bai, Qing Yang, Tiejun Zhao,
Abstract summary: StyleInject is a specialized fine-tuning approach tailored for text-to-image models. It adapts to varying styles by adjusting the variance of visual features based on the characteristics of the input signal. It proves particularly effective in learning from and enhancing a range of advanced, community-fine-tuned generative models.
Score: 35.732715025002705
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The ability to fine-tune generative models for text-to-image generation tasks is crucial, particularly facing the complexity involved in accurately interpreting and visualizing textual inputs. While LoRA is efficient for language model adaptation, it often falls short in text-to-image tasks due to the intricate demands of image generation, such as accommodating a broad spectrum of styles and nuances. To bridge this gap, we introduce StyleInject, a specialized fine-tuning approach tailored for text-to-image models. StyleInject comprises multiple parallel low-rank parameter matrices, maintaining the diversity of visual features. It dynamically adapts to varying styles by adjusting the variance of visual features based on the characteristics of the input signal. This approach significantly minimizes the impact on the original model's text-image alignment capabilities while adeptly adapting to various styles in transfer learning. StyleInject proves particularly effective in learning from and enhancing a range of advanced, community-fine-tuned generative models. Our comprehensive experiments, including both small-sample and large-scale data fine-tuning as well as base model distillation, show that StyleInject surpasses traditional LoRA in both text-image semantic consistency and human preference evaluation, all while ensuring greater parameter efficiency.

Related papers

StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation [24.588779332021137]
multimodal autoregressive (AR) models have shown exceptional capabilities across various domains.<n>Style-aligned generation requires a reference style image and prompt, resulting in a text-image-to-image triplet.<n>We propose StyleAR, an innovative approach that combines a specially designed data curation method with our proposed AR models.
arXiv Detail & Related papers (2025-05-26T12:01:15Z)
Bringing Characters to New Stories: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting [71.29100512700064]
We present T-Prompter, a training-free method for theme-specific image generation. T-Prompter integrates reference images into generative models, allowing users to seamlessly specify the target theme. Our approach enables consistent story generation, character design, realistic character generation, and style-guided image generation.
arXiv Detail & Related papers (2025-01-26T19:01:19Z)
ArtCrafter: Text-Image Aligning Style Transfer via Embedding Reframing [25.610375901522886]
ArtCrafter is a novel framework for text-to-image style transfer. We introduce an attention-based style extraction module. We also present a novel text-image aligning augmentation component.
arXiv Detail & Related papers (2025-01-03T19:17:27Z)
Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics [3.9717825324709413]
Style has been primarily considered in terms of artistic elements such as colors, brushstrokes, and lighting. In this study, we propose a zero-shot scheme for image variation with coordinated semantics.
arXiv Detail & Related papers (2024-10-24T08:34:57Z)
ZePo: Zero-Shot Portrait Stylization with Faster Sampling [61.14140480095604]
This paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps. We propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control.
arXiv Detail & Related papers (2024-08-10T08:53:41Z)
ArtWeaver: Advanced Dynamic Style Integration via Diffusion Model [73.95608242322949]
Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images. We present ArtWeaver, a novel framework that leverages pretrained Stable Diffusion to address challenges such as misinterpreted styles and inconsistent semantics.
arXiv Detail & Related papers (2024-05-24T07:19:40Z)
Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects. We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z)
ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors [105.37795139586075]
We propose a new task for stylizing'' text-to-image models, namely text-driven stylized image generation. We present a new diffusion model (ControlStyle) via upgrading a pre-trained text-to-image model with a trainable modulation network. Experiments demonstrate the effectiveness of our ControlStyle in producing more visually pleasing and artistic results.
arXiv Detail & Related papers (2023-11-09T15:50:52Z)
Improving Compositional Text-to-image Generation with Large Vision-Language Models [26.202725136839632]
compositional text-to-image models frequently encounter difficulties in generating high-quality images that align with input texts. We employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts. Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation.
arXiv Detail & Related papers (2023-10-10T05:09:05Z)
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z)
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models [60.63556257324894]
A key desired property of image generative models is the ability to disentangle different attributes. We propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms.
arXiv Detail & Related papers (2022-12-16T19:58:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.