Related papers: Frequency-Weighted Training Losses for Phoneme-Level DNN-based Speech Enhancement

Frequency-Weighted Training Losses for Phoneme-Level DNN-based Speech Enhancement

URL: http://arxiv.org/abs/2506.18714v1
Date: Mon, 23 Jun 2025 14:52:34 GMT
Title: Frequency-Weighted Training Losses for Phoneme-Level DNN-based Speech Enhancement
Authors: Nasser-Eddine Monir, Paul Magron, Romain Serizel,
Abstract summary: We propose perceptually-informed variants of the SDR loss, formulated in the time-frequency domain and modulated by frequency-dependent weighting schemes.<n>We train the FaSNet multichannel speech enhancement model using these various losses.<n> Experimental results show that while standard metrics such as the SDR are only marginally improved, their perceptual frequency-weighted counterparts exhibit a more substantial improvement.
Score: 15.332506773218315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in deep learning have significantly improved multichannel speech enhancement algorithms, yet conventional training loss functions such as the scale-invariant signal-to-distortion ratio (SDR) may fail to preserve fine-grained spectral cues essential for phoneme intelligibility. In this work, we propose perceptually-informed variants of the SDR loss, formulated in the time-frequency domain and modulated by frequency-dependent weighting schemes. These weights are designed to emphasize time-frequency regions where speech is prominent or where the interfering noise is particularly strong. We investigate both fixed and adaptive strategies, including ANSI band-importance weights, spectral magnitude-based weighting, and dynamic weighting based on the relative amount of speech and noise. We train the FaSNet multichannel speech enhancement model using these various losses. Experimental results show that while standard metrics such as the SDR are only marginally improved, their perceptual frequency-weighted counterparts exhibit a more substantial improvement. Besides, spectral and phoneme-level analysis indicates better consonant reconstruction, which points to a better preservation of certain acoustic cues.

Related papers

FreqMoE: Dynamic Frequency Enhancement for Neural PDE Solvers [33.5401363681771]
We propose FreqMoE, an efficient and progressive training framework that exploits the dependency of high-frequency signals on low-frequency components.<n>Experiments on both regular and irregular grid PDEs demonstrate that FreqMoE achieves up to 16.6% accuracy improvement.
arXiv Detail & Related papers (2025-05-11T06:06:32Z)
Revisiting Acoustic Features for Robust ASR [25.687120601256787]
We revisit the approach of earlier works that developed acoustic features inspired by biological auditory perception. We propose two new acoustic features called frequency masked spectrogram (FreqMask) and difference of gammatones spectrogram (DoGSpec) to simulate the neuro-psychological phenomena of frequency masking and lateral suppression.
arXiv Detail & Related papers (2024-09-24T18:58:23Z)
Improved Noise Schedule for Diffusion Training [51.849746576387375]
We propose a novel approach to design the noise schedule for enhancing the training of diffusion models.<n>We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule.
arXiv Detail & Related papers (2024-07-03T17:34:55Z)
DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification. DASA generates diversified training samples in speaker embedding space with negligible extra computing cost. The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z)
Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks [15.700048595212051]
We introduce a self-modulating convolutional neural network which utilizes correlated spectral and spatial information. At the core of the model lies a novel block, which allows the network to transform the features in an adaptive manner based on the adjacent spectral data. Experimental analysis on both synthetic and real data shows that the proposed SM-CNN outperforms other state-of-the-art HSI denoising methods.
arXiv Detail & Related papers (2023-09-15T06:57:43Z)
Incremental Spatial and Spectral Learning of Neural Operators for Solving Large-Scale PDEs [86.35471039808023]
We introduce the Incremental Fourier Neural Operator (iFNO), which progressively increases the number of frequency modes used by the model. We show that iFNO reduces total training time while maintaining or improving generalization performance across various datasets. Our method demonstrates a 10% lower testing error, using 20% fewer frequency modes compared to the existing Fourier Neural Operator, while also achieving a 30% faster training.
arXiv Detail & Related papers (2022-11-28T09:57:15Z)
Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations [5.4878772986187565]
We propose a sub-band based speech enhancement system with perceptually-motivated optimization and dual transformations, called PT-FSE. Our proposed model achieves substantial improvements over its backbone, but also outperforms the current state-of-the-art while being 27% smaller than the SOTA. With average NB-PESQ of 3.57 on the benchmark dataset, our system offers the best speech enhancement results reported till date.
arXiv Detail & Related papers (2022-09-24T02:33:40Z)
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z)
Time-domain Speech Enhancement with Generative Adversarial Learning [53.74228907273269]
This paper proposes a new framework called Time-domain Speech Enhancement Generative Adversarial Network (TSEGAN) TSEGAN is an extension of the generative adversarial network (GAN) in time-domain with metric evaluation to mitigate the scaling problem. In addition, we provide a new method based on objective function mapping for the theoretical analysis of the performance of Metric GAN.
arXiv Detail & Related papers (2021-03-30T08:09:49Z)
Improving Stability of LS-GANs for Audio and Speech Signals [70.15099665710336]
We show that encoding departure from normality computed in this vector space into the generator optimization formulation helps to craft more comprehensive spectrograms. We demonstrate the effectiveness of binding this metric for enhancing stability in training with less mode collapse compared to baseline GANs.
arXiv Detail & Related papers (2020-08-12T17:41:25Z)
End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization [43.15288441772729]
Denoising networks learn mapping from noisy speech to clean one directly. Existing schemes have either of two critical issues: spectrum and metric mismatches. This paper presents a new end-to-end denoising framework with the goal of joint SDR and PESQ optimization.
arXiv Detail & Related papers (2019-01-26T02:48:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.