Related papers: Diffusion Buffer for Online Generative Speech Enhancement

Diffusion Buffer for Online Generative Speech Enhancement

URL: http://arxiv.org/abs/2510.18744v1
Date: Tue, 21 Oct 2025 15:52:33 GMT
Title: Diffusion Buffer for Online Generative Speech Enhancement
Authors: Bunlong Lay, Rostislav Makarov, Simon Welker, Maris Hillemann, Timo Gerkmann,
Abstract summary: Diffusion Buffer is a generative diffusion-based Speech Enhancement model.<n>It only requires one neural network call per incoming signal frame from a stream of data.<n>It performs enhancement in an online fashion on a consumer-grade GPU.
Score: 32.98694610706198
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Online Speech Enhancement was mainly reserved for predictive models. A key advantage of these models is that for an incoming signal frame from a stream of data, the model is called only once for enhancement. In contrast, generative Speech Enhancement models often require multiple calls, resulting in a computational complexity that is too high for many online speech enhancement applications. This work presents the Diffusion Buffer, a generative diffusion-based Speech Enhancement model which only requires one neural network call per incoming signal frame from a stream of data and performs enhancement in an online fashion on a consumer-grade GPU. The key idea of the Diffusion Buffer is to align physical time with Diffusion time-steps. The approach progressively denoises frames through physical time, where past frames have more noise removed. Consequently, an enhanced frame is output to the listener with a delay defined by the Diffusion Buffer, and the output frame has a corresponding look-ahead. In this work, we extend upon our previous work by carefully designing a 2D convolutional UNet architecture that specifically aligns with the Diffusion Buffer's look-ahead. We observe that the proposed UNet improves performance, particularly when the algorithmic latency is low. Moreover, we show that using a Data Prediction loss instead of Denoising Score Matching loss enables flexible control over the trade-off between algorithmic latency and quality during inference. The extended Diffusion Buffer equipped with a novel NN and loss function drastically reduces the algorithmic latency from 320 - 960 ms to 32 - 176 ms with an even increased performance. While it has been shown before that offline generative diffusion models outperform predictive approaches in unseen noisy speech data, we confirm that the online Diffusion Buffer also outperforms its predictive counterpart on unseen noisy speech data.

Related papers

Causal Autoregressive Diffusion Language Model [70.7353007255797]
CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass.<n>Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation.
arXiv Detail & Related papers (2026-01-29T17:38:29Z)
Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding [36.74241893088594]
Diffusion Large Language Models (dLLMs) offer a compelling paradigm for natural language generation.<n>Recent works have accelerated inference via KV cache reuse or decoding, but overlook the intrinsic inefficiencies within the block-wise diffusion process.<n>We propose Streaming-dLLM, a training-free framework that streamlines inference across both spatial and temporal dimensions.
arXiv Detail & Related papers (2026-01-25T17:36:04Z)
DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model [7.852008880859938]
DyStream is a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio.<n>It can generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms.<n>It achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF.
arXiv Detail & Related papers (2025-12-30T18:43:38Z)
Real-Time Streamable Generative Speech Restoration with Flow Matching [35.33575179870606]
Stream$.$FM is a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms.<n>We show that high-quality streaming generative speech processing can be realized on consumer GPUs available today.
arXiv Detail & Related papers (2025-12-22T14:41:17Z)
Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation [50.04968365065964]
Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference.<n>We introduce Lightning-fast Caching-based Parallel denoising prediction (LightningCP)<n>We also propose Decoupled Foreground Attention (DFA) to further accelerate attention computations.
arXiv Detail & Related papers (2025-08-25T02:58:39Z)
READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second Latency [29.58683554898725]
We adapt a sliding window diffusion framework to the speech enhancement task.<n>Our approach corrupts speech signals through time, assigning more noise to frames close to the present in a buffer.<n>This marks the first practical diffusion-based solution for online speech enhancement.
arXiv Detail & Related papers (2025-06-03T14:14:28Z)
FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion [22.207275433870937]
Diffusion language models offer parallel token generation and inherent bidirectionality.<n>State-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference.<n>We introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking.
arXiv Detail & Related papers (2025-05-27T17:39:39Z)
Diffusion-Driven Semantic Communication for Generative Models with Bandwidth Constraints [66.63250537475973]
This paper introduces a diffusion-driven semantic communication framework with advanced VAE-based compression for bandwidth-constrained generative model.<n>Our experimental results demonstrate significant improvements in pixel-level metrics like peak signal to noise ratio (PSNR) and semantic metrics like learned perceptual image patch similarity (LPIPS)
arXiv Detail & Related papers (2024-07-26T02:34:25Z)
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation [52.56469577812338]
We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation.<n>Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction.<n>We present a novel approach that transforms the original sequential denoising into the denoising process.
arXiv Detail & Related papers (2023-12-19T18:18:33Z)
Real-time Streaming Video Denoising with Bidirectional Buffers [48.57108807146537]
Real-time denoising algorithms are typically adopted on the user device to remove the noise involved during the shooting and transmission of video streams. Recent multi-output inference works propagate the bidirectional temporal feature with a parallel or recurrent framework. We propose a Bidirectional Streaming Video Denoising framework, to achieve high-fidelity real-time denoising for streaming videos with both past and future temporal receptive fields.
arXiv Detail & Related papers (2022-07-14T14:01:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.