HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion
Models
- URL: http://arxiv.org/abs/2312.00079v1
- Date: Thu, 30 Nov 2023 02:33:29 GMT
- Title: HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion
Models
- Authors: Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark
Hasegawa-Johnson, Humphrey Shi, Tingbo Hou
- Abstract summary: We introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation.
Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations.
We extend our method to a novel image editing task: substituting the subject in an image through textual manipulations.
- Score: 56.112302700630806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper explores advancements in high-fidelity personalized image
generation through the utilization of pre-trained text-to-image diffusion
models. While previous approaches have made significant strides in generating
versatile scenes based on text descriptions and a few input images, challenges
persist in maintaining the subject fidelity within the generated images. In
this work, we introduce an innovative algorithm named HiFi Tuner to enhance the
appearance preservation of objects during personalized image generation. Our
proposed method employs a parameter-efficient fine-tuning framework, comprising
a denoising process and a pivotal inversion process. Key enhancements include
the utilization of mask guidance, a novel parameter regularization technique,
and the incorporation of step-wise subject representations to elevate the
sample fidelity. Additionally, we propose a reference-guided generation
approach that leverages the pivotal inversion of a reference image to mitigate
unwanted subject variations and artifacts. We further extend our method to a
novel image editing task: substituting the subject in an image through textual
manipulations. Experimental evaluations conducted on the DreamBooth dataset
using the Stable Diffusion model showcase promising results. Fine-tuning solely
on textual embeddings improves CLIP-T score by 3.6 points and improves DINO
score by 9.6 points over Textual Inversion. When fine-tuning all parameters,
HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2
points over DreamBooth, establishing a new state of the art.
Related papers
- Pan-denoising: Guided Hyperspectral Image Denoising via Weighted Represent Coefficient Total Variation [20.240211073097758]
This paper introduces a novel paradigm for hyperspectral image (HSI) denoising, which is termed textitpan-denoising.
Panchromatic (PAN) images capture similar structures and textures to HSIs but with less noise. Consequently, pan-denoising has the potential to uncover underlying structures and details beyond the internal information modeling of traditional HSI denoising methods.
Experiments on synthetic and real-world datasets demonstrate that PWRCTV outperforms several state-of-the-art methods in terms of metrics and visual quality.
arXiv Detail & Related papers (2024-07-08T16:05:56Z) - OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control [66.03885917320189]
OrientDream is a camera orientation conditioned framework for efficient and multi-view consistent 3D generation from textual prompts.
Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module.
Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods.
arXiv Detail & Related papers (2024-06-14T13:16:18Z) - ReNoise: Real Image Inversion Through Iterative Noising [62.96073631599749]
We introduce an inversion method with a high quality-to-operation ratio, enhancing reconstruction accuracy without increasing the number of operations.
We evaluate the performance of our ReNoise technique using various sampling algorithms and models, including recent accelerated diffusion models.
arXiv Detail & Related papers (2024-03-21T17:52:08Z) - DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators [56.994967294931286]
We introduce DreamDrone, a novel zero-shot and training-free pipeline for generating flythrough scenes from textual prompts.
We advocate explicitly warping the intermediate latent code of the pre-trained text-to-image diffusion model for high-quality image generation and unbounded generalization ability.
arXiv Detail & Related papers (2023-12-14T08:42:26Z) - Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion
Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps.
We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores.
Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z) - HiFi-123: Towards High-fidelity One Image to 3D Content Generation [64.81863143986384]
HiFi-123 is a method designed for high-fidelity and multi-view consistent 3D generation.
We present a Reference-Guided Novel View Enhancement (RGNV) technique that significantly improves the fidelity of diffusion-based zero-shot novel view synthesis methods.
We also present a novel Reference-Guided State Distillation (RGSD) loss.
arXiv Detail & Related papers (2023-10-10T16:14:20Z) - Reconstruct-and-Generate Diffusion Model for Detail-Preserving Image
Denoising [16.43285056788183]
We propose a novel approach called the Reconstruct-and-Generate Diffusion Model (RnG)
Our method leverages a reconstructive denoising network to recover the majority of the underlying clean signal.
It employs a diffusion algorithm to generate residual high-frequency details, thereby enhancing visual quality.
arXiv Detail & Related papers (2023-09-19T16:01:20Z) - HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion
Guidance [19.252300247300145]
This work proposes holistic sampling and smoothing approaches to achieve high-quality text-to-3D generation.
We compute denoising scores in the text-to-image diffusion model's latent and image spaces.
To generate high-quality renderings in a single-stage optimization, we propose regularization for the variance of z-coordinates along NeRF rays.
arXiv Detail & Related papers (2023-05-30T05:56:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.