StreamDiffusion: A Pipeline-level Solution for Real-time Interactive
Generation
- URL: http://arxiv.org/abs/2312.12491v1
- Date: Tue, 19 Dec 2023 18:18:33 GMT
- Title: StreamDiffusion: A Pipeline-level Solution for Real-time Interactive
Generation
- Authors: Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei
Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Kurt Keutzer
- Abstract summary: We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation.
Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction.
We present a novel approach that transforms the original sequential denoising into the denoising process.
- Score: 29.30999290150683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce StreamDiffusion, a real-time diffusion pipeline designed for
interactive image generation. Existing diffusion models are adept at creating
images from text or image prompts, yet they often fall short in real-time
interaction. This limitation becomes particularly evident in scenarios
involving continuous input, such as Metaverse, live video streaming, and
broadcasting, where high throughput is imperative. To address this, we present
a novel approach that transforms the original sequential denoising into the
batching denoising process. Stream Batch eliminates the conventional
wait-and-interact approach and enables fluid and high throughput streams. To
handle the frequency disparity between data input and model throughput, we
design a novel input-output queue for parallelizing the streaming process.
Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG),
which requires additional U-Net computation. To mitigate the redundant
computations, we propose a novel residual classifier-free guidance (RCFG)
algorithm that reduces the number of negative conditional denoising steps to
only one or even zero. Besides, we introduce a stochastic similarity
filter(SSF) to optimize power consumption. Our Stream Batch achieves around
1.5x speedup compared to the sequential denoising method at different denoising
levels. The proposed RCFG leads to speeds up to 2.05x higher than the
conventional CFG. Combining the proposed strategies and existing mature
acceleration tools makes the image-to-image generation achieve up-to 91.07fps
on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers
over 59.56x. Furthermore, our proposed StreamDiffusion also significantly
reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one
RTX4090, respectively.
Related papers
- ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.
We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.
Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission [24.372996233209854]
DiffJSCC is a novel framework that produces high-realism images via the conditional diffusion denoising process.
It can achieve highly realistic reconstructions for 768x512 pixel Kodak images with only 3072 symbols.
arXiv Detail & Related papers (2024-04-27T00:12:13Z) - Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features.
We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps.
We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z) - StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video
Sequences [31.210626775505407]
Occlusions between consecutive frames have long posed a significant challenge in optical flow estimation.
We present a Streamlined In-batch Multi-frame (SIM) pipeline tailored to video input, attaining a similar level of time efficiency to two-frame networks.
StreamFlow not only excels in terms of performance on challenging KITTI and Sintel datasets, with particular improvement in occluded areas.
arXiv Detail & Related papers (2023-11-28T07:53:51Z) - DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and
Highlight Detection [38.12212015133935]
A novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process.
Experiments conducted on five widely-used benchmarks demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.
arXiv Detail & Related papers (2023-08-29T08:20:23Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z) - Q-Diffusion: Quantizing Diffusion Models [52.978047249670276]
Post-training quantization (PTQ) is considered a go-to compression method for other tasks.
We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture.
We show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance.
arXiv Detail & Related papers (2023-02-08T19:38:59Z) - Real-time Streaming Video Denoising with Bidirectional Buffers [48.57108807146537]
Real-time denoising algorithms are typically adopted on the user device to remove the noise involved during the shooting and transmission of video streams.
Recent multi-output inference works propagate the bidirectional temporal feature with a parallel or recurrent framework.
We propose a Bidirectional Streaming Video Denoising framework, to achieve high-fidelity real-time denoising for streaming videos with both past and future temporal receptive fields.
arXiv Detail & Related papers (2022-07-14T14:01:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.