StreamDiffusion: A Pipeline-level Solution for Real-time Interactive
Generation
- URL: http://arxiv.org/abs/2312.12491v1
- Date: Tue, 19 Dec 2023 18:18:33 GMT
- Title: StreamDiffusion: A Pipeline-level Solution for Real-time Interactive
Generation
- Authors: Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei
Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Kurt Keutzer
- Abstract summary: We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation.
Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction.
We present a novel approach that transforms the original sequential denoising into the denoising process.
- Score: 29.30999290150683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce StreamDiffusion, a real-time diffusion pipeline designed for
interactive image generation. Existing diffusion models are adept at creating
images from text or image prompts, yet they often fall short in real-time
interaction. This limitation becomes particularly evident in scenarios
involving continuous input, such as Metaverse, live video streaming, and
broadcasting, where high throughput is imperative. To address this, we present
a novel approach that transforms the original sequential denoising into the
batching denoising process. Stream Batch eliminates the conventional
wait-and-interact approach and enables fluid and high throughput streams. To
handle the frequency disparity between data input and model throughput, we
design a novel input-output queue for parallelizing the streaming process.
Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG),
which requires additional U-Net computation. To mitigate the redundant
computations, we propose a novel residual classifier-free guidance (RCFG)
algorithm that reduces the number of negative conditional denoising steps to
only one or even zero. Besides, we introduce a stochastic similarity
filter(SSF) to optimize power consumption. Our Stream Batch achieves around
1.5x speedup compared to the sequential denoising method at different denoising
levels. The proposed RCFG leads to speeds up to 2.05x higher than the
conventional CFG. Combining the proposed strategies and existing mature
acceleration tools makes the image-to-image generation achieve up-to 91.07fps
on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers
over 59.56x. Furthermore, our proposed StreamDiffusion also significantly
reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one
RTX4090, respectively.
Related papers
- Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos.
We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces.
We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z) - One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation [60.54811860967658]
FluxSR is a novel one-step diffusion Real-ISR based on flow matching models.
First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR.
Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss.
arXiv Detail & Related papers (2025-02-04T04:11:29Z) - ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.
We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.
Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z) - RDEIC: Accelerating Diffusion-Based Extreme Image Compression with Relay Residual Diffusion [29.277211609920155]
We present Relay Residual Diffusion Extreme Image Compression (RDEIC)
We first use the compressed latent features of the image with added noise, instead of pure noise, as the starting point to eliminate the unnecessary initial stages of the denoising process.
RDEIC achieves state-of-the-art visual quality and outperforms existing diffusion-based extreme image compression methods in both fidelity and efficiency.
arXiv Detail & Related papers (2024-10-03T16:24:20Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission [24.372996233209854]
DiffJSCC is a novel framework that produces high-realism images via the conditional diffusion denoising process.
It can achieve highly realistic reconstructions for 768x512 pixel Kodak images with only 3072 symbols.
arXiv Detail & Related papers (2024-04-27T00:12:13Z) - Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features.
We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps.
We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z) - StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video
Sequences [31.210626775505407]
Occlusions between consecutive frames have long posed a significant challenge in optical flow estimation.
We present a Streamlined In-batch Multi-frame (SIM) pipeline tailored to video input, attaining a similar level of time efficiency to two-frame networks.
StreamFlow not only excels in terms of performance on challenging KITTI and Sintel datasets, with particular improvement in occluded areas.
arXiv Detail & Related papers (2023-11-28T07:53:51Z) - Real-time Streaming Video Denoising with Bidirectional Buffers [48.57108807146537]
Real-time denoising algorithms are typically adopted on the user device to remove the noise involved during the shooting and transmission of video streams.
Recent multi-output inference works propagate the bidirectional temporal feature with a parallel or recurrent framework.
We propose a Bidirectional Streaming Video Denoising framework, to achieve high-fidelity real-time denoising for streaming videos with both past and future temporal receptive fields.
arXiv Detail & Related papers (2022-07-14T14:01:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.