Related papers: StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

URL: http://arxiv.org/abs/2312.12491v1
Date: Tue, 19 Dec 2023 18:18:33 GMT
Title: StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation
Authors: Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Kurt Keutzer
Abstract summary: We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. We present a novel approach that transforms the original sequential denoising into the denoising process.
Score: 29.30999290150683
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as Metaverse, live video streaming, and broadcasting, where high throughput is imperative. To address this, we present a novel approach that transforms the original sequential denoising into the batching denoising process. Stream Batch eliminates the conventional wait-and-interact approach and enables fluid and high throughput streams. To handle the frequency disparity between data input and model throughput, we design a novel input-output queue for parallelizing the streaming process. Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG), which requires additional U-Net computation. To mitigate the redundant computations, we propose a novel residual classifier-free guidance (RCFG) algorithm that reduces the number of negative conditional denoising steps to only one or even zero. Besides, we introduce a stochastic similarity filter(SSF) to optimize power consumption. Our Stream Batch achieves around 1.5x speedup compared to the sequential denoising method at different denoising levels. The proposed RCFG leads to speeds up to 2.05x higher than the conventional CFG. Combining the proposed strategies and existing mature acceleration tools makes the image-to-image generation achieve up-to 91.07fps on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers over 59.56x. Furthermore, our proposed StreamDiffusion also significantly reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one RTX4090, respectively.

Related papers

LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models [17.858801012726445]
Diffusion-based models have gained wide adoption in the virtual human generation due to their outstanding expressiveness.<n>We present a novel audio-driven portrait video generation framework based on the diffusion model to address these challenges.<n>Our model achieves a maximum of 78 FPS at a resolution of 384x384 and 45 FPS at a resolution of 512x512, with an initial video generation latency of 140 ms and 215 ms, respectively.
arXiv Detail & Related papers (2025-06-06T07:09:07Z)
Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion [16.99620863197586]
Diffusion language models offer parallel token generation and inherent bidirectionality.<n>State-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference.<n>We introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking.<n>For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models.
arXiv Detail & Related papers (2025-05-27T17:39:39Z)
DDT: Decoupled Diffusion Transformer [51.84206763079382]
Diffusion transformers encode noisy inputs to extract semantic component and decode higher frequency with identical modules. textbfcolorddtDecoupled textbfcolorddtTransformer(textbfcolorddtDDT) textbfcolorddtTransformer(textbfcolorddtDDT) textbfcolorddtTransformer(textbfcolorddtDDT)
arXiv Detail & Related papers (2025-04-08T07:17:45Z)
Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos. We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z)
One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation [60.54811860967658]
FluxSR is a novel one-step diffusion Real-ISR based on flow matching models. First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR. Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss.
arXiv Detail & Related papers (2025-02-04T04:11:29Z)
Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
We propose an algorithm that enables fast and high-quality generation under arbitrary constraints. During inference, we can interchange between gradient updates computed on the noisy image and updates computed on the final, clean image. Our approach produces results that rival or surpass the state-of-the-art training-free inference approaches.
arXiv Detail & Related papers (2024-10-24T14:52:38Z)
ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation. We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning. Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z)
RDEIC: Accelerating Diffusion-Based Extreme Image Compression with Relay Residual Diffusion [29.277211609920155]
We present Relay Residual Diffusion Extreme Image Compression (RDEIC) We first use the compressed latent features of the image with added noise, instead of pure noise, as the starting point to eliminate the unnecessary initial stages of the denoising process. RDEIC achieves state-of-the-art visual quality and outperforms existing diffusion-based extreme image compression methods in both fidelity and efficiency.
arXiv Detail & Related papers (2024-10-03T16:24:20Z)
Diffusion-Driven Semantic Communication for Generative Models with Bandwidth Constraints [66.63250537475973]
This paper introduces a diffusion-driven semantic communication framework with advanced VAE-based compression for bandwidth-constrained generative model.<n>Our experimental results demonstrate significant improvements in pixel-level metrics like peak signal to noise ratio (PSNR) and semantic metrics like learned perceptual image patch similarity (LPIPS)
arXiv Detail & Related papers (2024-07-26T02:34:25Z)
Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio. We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z)
Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission [24.372996233209854]
DiffJSCC is a novel framework that produces high-realism images via the conditional diffusion denoising process. It can achieve highly realistic reconstructions for 768x512 pixel Kodak images with only 3072 symbols.
arXiv Detail & Related papers (2024-04-27T00:12:13Z)
Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features. We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps. We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z)
StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences [31.210626775505407]
Occlusions between consecutive frames have long posed a significant challenge in optical flow estimation. We present a Streamlined In-batch Multi-frame (SIM) pipeline tailored to video input, attaining a similar level of time efficiency to two-frame networks. StreamFlow not only excels in terms of performance on challenging KITTI and Sintel datasets, with particular improvement in occluded areas.
arXiv Detail & Related papers (2023-11-28T07:53:51Z)
DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection [38.12212015133935]
A novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process. Experiments conducted on five widely-used benchmarks demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.
arXiv Detail & Related papers (2023-08-29T08:20:23Z)
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z)
Q-Diffusion: Quantizing Diffusion Models [52.978047249670276]
Post-training quantization (PTQ) is considered a go-to compression method for other tasks. We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture. We show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance.
arXiv Detail & Related papers (2023-02-08T19:38:59Z)
Real-time Streaming Video Denoising with Bidirectional Buffers [48.57108807146537]
Real-time denoising algorithms are typically adopted on the user device to remove the noise involved during the shooting and transmission of video streams. Recent multi-output inference works propagate the bidirectional temporal feature with a parallel or recurrent framework. We propose a Bidirectional Streaming Video Denoising framework, to achieve high-fidelity real-time denoising for streaming videos with both past and future temporal receptive fields.
arXiv Detail & Related papers (2022-07-14T14:01:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.