Related papers: Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising

Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising

URL: http://arxiv.org/abs/2603.02794v1
Date: Tue, 03 Mar 2026 09:31:36 GMT
Title: Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising
Authors: Riccardo Rota, Kiril Ratmanski, Jozef Coldenhoff, Milos Cernak,
Abstract summary: We present TVF (Time-Varying Filtering), a low-latency speech enhancement model with 1 million parameters.<n>TVF bridges the gap between traditional filtering and modern neural speech modeling.<n>Model utilizes a lightweight neural network backbone to predict the coefficients of a differentiable 35-band IIR filter cascade in real time.
Score: 12.191881845807082
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present TVF (Time-Varying Filtering), a low-latency speech enhancement model with 1 million parameters. Combining the interpretability of Digital Signal Processing (DSP) with the adaptability of deep learning, TVF bridges the gap between traditional filtering and modern neural speech modeling. The model utilizes a lightweight neural network backbone to predict the coefficients of a differentiable 35-band IIR filter cascade in real time, allowing it to adapt dynamically to non-stationary noise. Unlike ``black-box'' deep learning approaches, TVF offers a completely interpretable processing chain, where spectral modifications are explicit and adjustable. We demonstrate the efficacy of this approach on a speech denoising task using the Valentini-Botinhao dataset and compare the results to a static DDSP approach and a fully deep-learning-based solution, showing that TVF achieves effective adaptation to changing noise conditions.

Related papers

Self-Supervised Learning via Flow-Guided Neural Operator on Time-Series Data [57.85958428020496]
Flow-Guided Neural Operator (FGNO) is a novel framework combining operator learning with flow matching for SSL training.<n>FGNO learns mappings in functional spaces by using Short-Time Fourier Transform to unify different time resolutions.<n>Unlike prior generative SSL methods that use noisy inputs during inference, we propose using clean inputs for representation extraction while learning representations with noise.
arXiv Detail & Related papers (2026-02-12T18:54:57Z)
Differentiable Attenuation Filters for Feedback Delay Networks [3.8530395083350615]
We introduce a novel method for designing attenuation filters in digital audio reverberation systems based on Feedback Delay Net- works (FDNs)<n>Our approach uses Second Order Sections (SOS) of Infinite Impulse Response (IIR) filters arranged as parametric equalizers (PEQ)<n>Our method delivers a flexible and differentiable design, achieving state-of-the-art per- formance while significantly reducing computational cost.
arXiv Detail & Related papers (2025-11-25T15:01:55Z)
Learnable Total Variation with Lambda Mapping for Low-Dose CT Denoising [0.0]
Learnable Total Variation (LTV) couples an unrolled TV solver with a data-driven Lambda Mapping Network (LambdaNet) predicting a per-pixel regularization map.<n>LTV provides an interpretable alternative to black-box CNNs and a basis for 3D and data-consistency-driven reconstruction.
arXiv Detail & Related papers (2025-11-13T17:05:36Z)
BADiff: Bandwidth Adaptive Diffusion Model [55.10134744772338]
Traditional diffusion models produce high-fidelity images by performing a fixed number of denoising steps, regardless of downstream transmission limitations.<n>In practical cloud-to-device scenarios, limited bandwidth often necessitates heavy compression, leading to loss of fine textures and wasted computation.<n>We introduce a joint end-to-end training strategy where the diffusion model is conditioned on a target quality level derived from the available bandwidth.
arXiv Detail & Related papers (2025-10-24T11:50:03Z)
Resampling Filter Design for Multirate Neural Audio Effect Processing [9.149661171430257]
We show that a two-stage design consisting of a half-band IIR filter cascaded with a Kaiser FIR window filter can give similar or better results to the previously proposed model adjustment method.<n>We investigate amplifiers and decimation filters for the task of integer oversampling and show that cascaded half-band IIR and FIR designs can be used in conjunction with the model adjustment method.
arXiv Detail & Related papers (2025-01-30T16:44:49Z)
VRVVC: Variable-Rate NeRF-Based Volumetric Video Compression [59.14355576912495]
NeRF-based video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences.<n>The substantial data volumes pose significant challenges for storage and transmission.<n>We propose VRVVC, a novel end-to-end joint variable-rate framework for video compression.
arXiv Detail & Related papers (2024-12-16T01:28:04Z)
Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising [15.152748065111194]
This paper describes speech enhancement for realtime automatic speech recognition in real environments. It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming. The performance of such a supervised approach, however, is drastically degraded under mismatched conditions.
arXiv Detail & Related papers (2024-10-30T08:32:47Z)
Multi-stage image denoising with the wavelet transform [125.2251438120701]
Deep convolutional neural networks (CNNs) are used for image denoising via automatically mining accurate structure information. We propose a multi-stage image denoising CNN with the wavelet transform (MWDCNN) via three stages, i.e., a dynamic convolutional block (DCB), two cascaded wavelet transform and enhancement blocks (WEBs) and residual block (RB)
arXiv Detail & Related papers (2022-09-26T03:28:23Z)
Direction-Aware Adaptive Online Neural Speech Enhancement with an Augmented Reality Headset in Real Noisy Conversational Environments [21.493664174262737]
This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset. It helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party) The method is used with a blind dereverberation method called weighted prediction error (WPE) for transcribing the noisy reverberant speech of a speaker.
arXiv Detail & Related papers (2022-07-15T05:14:27Z)
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z)
Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations [67.18006078950337]
We use the global context information to enhance important channels and recalibrate salient time-frequency locations. The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset.
arXiv Detail & Related papers (2020-09-02T01:07:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.