Related papers: One-Step Image Translation with Text-to-Image Models

One-Step Image Translation with Text-to-Image Models

URL: http://arxiv.org/abs/2403.12036v1
Date: Mon, 18 Mar 2024 17:59:40 GMT
Title: One-Step Image Translation with Text-to-Image Models
Authors: Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, Jun-Yan Zhu,
Abstract summary: We introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. We consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights. Our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks.
Score: 35.0987002313882
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at https://github.com/GaParmar/img2img-turbo.

Related papers

ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion [7.233066974580282]
Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution.<n>Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models.<n>We propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training.
arXiv Detail & Related papers (2025-10-29T17:17:32Z)
FideDiff: Efficient Diffusion Model for High-Fidelity Image Motion Deblurring [33.809728459395785]
We introduce FideDiff, a novel single-step diffusion model designed for high-fidelity deblurring.<n>We reformulate motion deblurring as a diffusion-like process where each timestep represents a progressively blurred image.<n>By reconstructing training data with matched blur trajectories, the model learns temporal consistency, enabling accurate one-step deblurring.
arXiv Detail & Related papers (2025-10-02T03:44:45Z)
One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step. To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration. Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z)
Causal Diffusion Transformers for Generative Modeling [19.919979972882466]
We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. CaulFusion is a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels.
arXiv Detail & Related papers (2024-12-16T18:59:29Z)
Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints. We empirically find that this training paradigm limits the one-step generation performance of consistency models. We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z)
Decouple-Then-Merge: Finetune Diffusion Models as Multi-Task Learning [45.89372687373466]
Diffusion models are trained by learning a sequence of models that reverse each step of noise corruption. The parameters are fully shared across multiple timesteps to enhance training efficiency. However, since the denoising tasks differ at each timestep, the gradients computed at different timesteps may conflict, potentially degrading the overall performance of image generation.
arXiv Detail & Related papers (2024-10-09T08:19:25Z)
Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model [31.70050311326183]
Diffusion models tend to generate videos with less motion than expected. We address this issue from both inference and training aspects. Our methods outperform baselines by producing higher motion scores with lower errors.
arXiv Detail & Related papers (2024-06-22T04:56:16Z)
Plug-and-Play Diffusion Distillation [14.359953671470242]
We propose a new distillation approach for guided diffusion models. An external lightweight guide model is trained while the original text-to-image model remains frozen. We show that our method reduces the inference of classifier-free guided latent-space diffusion models by almost half.
arXiv Detail & Related papers (2024-06-04T04:22:47Z)
Fixed Point Diffusion Models [13.035518953879539]
Fixed Point Diffusion Model (FPDM) is a novel approach to image generation that integrates the concept of fixed point solving into the framework of diffusion-based generative modeling. Our approach embeds an implicit fixed point solving layer into the denoising network of a diffusion model, transforming the diffusion process into a sequence of closely-related fixed point problems. We conduct experiments with state-of-the-art models on ImageNet, FFHQ, CelebA-HQ, and LSUN-Church, demonstrating substantial improvements in performance and efficiency.
arXiv Detail & Related papers (2024-01-16T18:55:54Z)
DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing [28.593023489682654]
We present DiffMorpher, the first approach enabling smooth and natural image morphing using diffusion models. Our key idea is to capture the semantics of the two images by fitting two LoRAs to them respectively, and interpolate between both the LoRA parameters and the latent noises to ensure a smooth semantic transition. In addition, we propose an attention and injection technique and a new sampling schedule to further enhance the smoothness between consecutive images.
arXiv Detail & Related papers (2023-12-12T16:28:08Z)
ExposureDiffusion: Learning to Expose for Low-light Image Enhancement [87.08496758469835]
This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks.
arXiv Detail & Related papers (2023-07-15T04:48:35Z)
SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds [88.06788636008051]
Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. These models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. We present a generic approach that unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds.
arXiv Detail & Related papers (2023-06-01T17:59:25Z)
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models. Our method leverages a pretrained large language model for grounded generation in a novel two-stage process. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z)
SinDiffusion: Learning a Diffusion Model from a Single Natural Image [159.4285444680301]
We present SinDiffusion, leveraging denoising diffusion models to capture internal distribution of patches from a single natural image. It is based on two core designs. First, SinDiffusion is trained with a single model at a single scale instead of multiple models with progressive growing of scales. Second, we identify that a patch-level receptive field of the diffusion network is crucial and effective for capturing the image's patch statistics.
arXiv Detail & Related papers (2022-11-22T18:00:03Z)
On Distillation of Guided Diffusion Models [94.95228078141626]
We propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from. For standard diffusion models trained on the pixelspace, our approach is able to generate images visually comparable to that of the original model. For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps.
arXiv Detail & Related papers (2022-10-06T18:03:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.