End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization
- URL: http://arxiv.org/abs/1901.09146v4
- Date: Wed, 8 Mar 2023 23:46:09 GMT
- Title: End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization
- Authors: Jaeyoung Kim, Mostafa El-Khamy and Jungwon Lee
- Abstract summary: Denoising networks learn mapping from noisy speech to clean one directly.
Existing schemes have either of two critical issues: spectrum and metric mismatches.
This paper presents a new end-to-end denoising framework with the goal of joint SDR and PESQ optimization.
- Score: 43.15288441772729
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Supervised learning based on a deep neural network recently has achieved
substantial improvement on speech enhancement. Denoising networks learn mapping
from noisy speech to clean one directly, or to a spectrum mask which is the
ratio between clean and noisy spectra. In either case, the network is optimized
by minimizing mean square error (MSE) between ground-truth labels and
time-domain or spectrum output. However, existing schemes have either of two
critical issues: spectrum and metric mismatches. The spectrum mismatch is a
well known issue that any spectrum modification after short-time Fourier
transform (STFT), in general, cannot be fully recovered after inverse
short-time Fourier transform (ISTFT). The metric mismatch is that a
conventional MSE metric is sub-optimal to maximize our target metrics,
signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality
(PESQ). This paper presents a new end-to-end denoising framework with the goal
of joint SDR and PESQ optimization. First, the network optimization is
performed on the time-domain signals after ISTFT to avoid spectrum mismatch.
Second, two loss functions which have improved correlations with SDR and PESQ
metrics are proposed to minimize metric mismatch. The experimental result
showed that the proposed denoising scheme significantly improved both SDR and
PESQ performance over the existing methods.
Related papers
- Fourier Amplitude and Correlation Loss: Beyond Using L2 Loss for Skillful Precipitation Nowcasting [11.931403313504754]
We propose a new Fourier Amplitude and Correlation Loss (FACL) which consists of two novel loss terms.
The two loss terms work together to replace the traditional $L$ losses such as MSE and weighted MSE for thetemporal prediction problem.
Our method improves perceptual metrics and meteorology skill scores, with a small trade-off to pixel-wise accuracy and structural similarity.
arXiv Detail & Related papers (2024-10-30T16:12:56Z) - Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising [15.152748065111194]
This paper describes speech enhancement for realtime automatic speech recognition in real environments.
It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming.
The performance of such a supervised approach, however, is drastically degraded under mismatched conditions.
arXiv Detail & Related papers (2024-10-30T08:32:47Z) - UDHF2-Net: Uncertainty-diffusion-model-based High-Frequency TransFormer Network for Remotely Sensed Imagery Interpretation [12.24506241611653]
Uncertainty-diffusion-model-based high-Frequency TransFormer network (UDHF2-Net) is the first to be proposed.
UDHF2-Net is a spatially-stationary-and-non-stationary high-frequency connection paradigm (SHCP)
Mask-and-geo-knowledge-based uncertainty diffusion module (MUDM) is a self-supervised learning strategy.
A frequency-wise semi-pseudo-Siamese UDHF2-Net is the first to be proposed to balance accuracy and complexity for change detection.
arXiv Detail & Related papers (2024-06-23T15:03:35Z) - Optimization Guarantees of Unfolded ISTA and ADMM Networks With Smooth
Soft-Thresholding [57.71603937699949]
We study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs.
We show that the threshold on the number of training samples increases with the increase in the network width.
arXiv Detail & Related papers (2023-09-12T13:03:47Z) - Iterative Adaptive Spectroscopy of Short Signals [0.1338174941551702]
We develop an adaptive frequency sensing protocol based on Ramsey interferometry.
High precision is achieved by enhancing the Ramsey sequence to prepare with high fidelity both the sensing and readout state.
arXiv Detail & Related papers (2022-04-10T18:07:50Z) - A neural network-supported two-stage algorithm for lightweight
dereverberation on hearing devices [13.49645012479288]
A two-stage lightweight online dereverberation algorithm for hearing devices is presented in this paper.
The approach combines a multi-channel multi-frame linear filter with a single-channel single-frame post-filter.
Both components rely on power spectral density (PSD) estimates provided by deep neural networks (DNNs)
arXiv Detail & Related papers (2022-04-06T11:08:28Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - Improving Stability of LS-GANs for Audio and Speech Signals [70.15099665710336]
We show that encoding departure from normality computed in this vector space into the generator optimization formulation helps to craft more comprehensive spectrograms.
We demonstrate the effectiveness of binding this metric for enhancing stability in training with less mode collapse compared to baseline GANs.
arXiv Detail & Related papers (2020-08-12T17:41:25Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Rethinking Differentiable Search for Mixed-Precision Neural Networks [83.55785779504868]
Low-precision networks with weights and activations quantized to low bit-width are widely used to accelerate inference on edge devices.
Current solutions are uniform, using identical bit-width for all filters.
This fails to account for the different sensitivities of different filters and is suboptimal.
Mixed-precision networks address this problem, by tuning the bit-width to individual filter requirements.
arXiv Detail & Related papers (2020-04-13T07:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.