A Hybrid Model for Weakly-Supervised Speech Dereverberation
- URL: http://arxiv.org/abs/2502.06839v1
- Date: Thu, 06 Feb 2025 09:21:22 GMT
- Title: A Hybrid Model for Weakly-Supervised Speech Dereverberation
- Authors: Louis Bahrman, Mathieu Fontaine, Gael Richard,
- Abstract summary: This paper introduces a new training strategy to improve speech dereverberation systems using minimal acoustic information and reverberant (wet) speech.
Experimental results demonstrate that our method achieves more consistent performance across various objective metrics used in speech dereverberation than the state-of-the-art.
- Score: 2.731944614640173
- License:
- Abstract: This paper introduces a new training strategy to improve speech dereverberation systems using minimal acoustic information and reverberant (wet) speech. Most existing algorithms rely on paired dry/wet data, which is difficult to obtain, or on target metrics that may not adequately capture reverberation characteristics and can lead to poor results on non-target metrics. Our approach uses limited acoustic information, like the reverberation time (RT60), to train a dereverberation system. The system's output is resynthesized using a generated room impulse response and compared with the original reverberant speech, providing a novel reverberation matching loss replacing the standard target metrics. During inference, only the trained dereverberation model is used. Experimental results demonstrate that our method achieves more consistent performance across various objective metrics used in speech dereverberation than the state-of-the-art.
Related papers
- Unsupervised Blind Joint Dereverberation and Room Acoustics Estimation with Diffusion Models [21.669363620480333]
We present an unsupervised method for blind dereverberation and room impulse response estimation, called BUDDy.
In a blind scenario where the room impulse response is unknown, BUDDy successfully performs speech dereverberation.
Unlike supervised methods, which often struggle to generalize, BUDDy seamlessly adapts to different acoustic conditions.
arXiv Detail & Related papers (2024-08-14T11:31:32Z) - Diffusion-based speech enhancement with a weighted generative-supervised
learning loss [0.0]
Diffusion-based generative models have recently gained attention in speech enhancement (SE)
We propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech.
arXiv Detail & Related papers (2023-09-19T09:13:35Z) - Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models [95.97506031821217]
We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training.
The method requires a short (3 seconds) sample from the target person, and generation is steered at inference time, without any training steps.
arXiv Detail & Related papers (2022-06-05T19:45:29Z) - On monoaural speech enhancement for automatic recognition of real noisy
speech using mixture invariant training [33.79711018198589]
We extend the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data.
It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech.
The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts.
arXiv Detail & Related papers (2022-05-03T19:37:58Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot
Learning with Knowledge Distillation [26.39206098000297]
We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity.
Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker.
Instead of the missing clean utterance target, we distill the more advanced denoising results from an overly large teacher model.
arXiv Detail & Related papers (2021-05-08T00:42:03Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z) - Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems.
Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.