Time-domain Speech Enhancement with Generative Adversarial Learning
- URL: http://arxiv.org/abs/2103.16149v1
- Date: Tue, 30 Mar 2021 08:09:49 GMT
- Title: Time-domain Speech Enhancement with Generative Adversarial Learning
- Authors: Feiyang Xiao, Jian Guan, Qiuqiang Kong, Wenwu Wang
- Abstract summary: This paper proposes a new framework called Time-domain Speech Enhancement Generative Adversarial Network (TSEGAN)
TSEGAN is an extension of the generative adversarial network (GAN) in time-domain with metric evaluation to mitigate the scaling problem.
In addition, we provide a new method based on objective function mapping for the theoretical analysis of the performance of Metric GAN.
- Score: 53.74228907273269
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech enhancement aims to obtain speech signals with high intelligibility
and quality from noisy speech. Recent work has demonstrated the excellent
performance of time-domain deep learning methods, such as Conv-TasNet. However,
these methods can be degraded by the arbitrary scales of the waveform induced
by the scale-invariant signal-to-noise ratio (SI-SNR) loss. This paper proposes
a new framework called Time-domain Speech Enhancement Generative Adversarial
Network (TSEGAN), which is an extension of the generative adversarial network
(GAN) in time-domain with metric evaluation to mitigate the scaling problem,
and provide model training stability, thus achieving performance improvement.
In addition, we provide a new method based on objective function mapping for
the theoretical analysis of the performance of Metric GAN, and explain why it
is better than the Wasserstein GAN. Experiments conducted demonstrate the
effectiveness of our proposed method, and illustrate the advantage of Metric
GAN.
Related papers
- SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Histogram Layer Time Delay Neural Networks for Passive Sonar
Classification [58.720142291102135]
A novel method combines a time delay neural network and histogram layer to incorporate statistical contexts for improved feature learning and underwater acoustic target classification.
The proposed method outperforms the baseline model, demonstrating the utility in incorporating statistical contexts for passive sonar target recognition.
arXiv Detail & Related papers (2023-07-25T19:47:26Z) - SCP-GAN: Self-Correcting Discriminator Optimization for Training
Consistency Preserving Metric GAN on Speech Enhancement Tasks [28.261911789087463]
We introduce several improvements to the GAN training schemes, which can be applied to most GAN-based SE models.
We present self-correcting optimization for training a GAN discriminator on SE tasks, which helps avoid "harmful" training directions.
We have tested our proposed methods on several state-of-the-art GAN-based SE models and obtained consistent improvements.
arXiv Detail & Related papers (2022-10-26T04:48:40Z) - Speech Enhancement with Perceptually-motivated Optimization and Dual
Transformations [5.4878772986187565]
We propose a sub-band based speech enhancement system with perceptually-motivated optimization and dual transformations, called PT-FSE.
Our proposed model achieves substantial improvements over its backbone, but also outperforms the current state-of-the-art while being 27% smaller than the SOTA.
With average NB-PESQ of 3.57 on the benchmark dataset, our system offers the best speech enhancement results reported till date.
arXiv Detail & Related papers (2022-09-24T02:33:40Z) - Speech Enhancement with Score-Based Generative Models in the Complex
STFT Domain [18.090665052145653]
We propose a novel training task for speech enhancement using a complex-valued deep neural network.
We derive this training task within the formalism of differential equations, thereby enabling the use of predictor-corrector samplers.
arXiv Detail & Related papers (2022-03-31T12:53:47Z) - A Novel Speech Intelligibility Enhancement Model based on
CanonicalCorrelation and Deep Learning [12.913738983870621]
We present a canonical correlation based short-time objective intelligibility (CC-STOI) cost function to train a fully convolutional neural network (FCN) model.
We show that our CC-STOI based speech enhancement framework outperforms state-of-the-art DL models trained with conventional distance-based and STOI-based loss functions.
arXiv Detail & Related papers (2022-02-11T16:48:41Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Improving noise robust automatic speech recognition with single-channel
time-domain enhancement network [100.1041336974175]
We show that a single-channel time-domain denoising approach can significantly improve ASR performance.
We show that single-channel noise reduction can still improve ASR performance.
arXiv Detail & Related papers (2020-03-09T09:36:31Z) - Single Channel Speech Enhancement Using Temporal Convolutional Recurrent
Neural Networks [23.88788382262305]
temporal convolutional recurrent network (TCRN) is an end-to-end model that directly map noisy waveform to clean waveform.
We show that our model is able to improve the performance of model, compared with existing convolutional recurrent networks.
arXiv Detail & Related papers (2020-02-02T04:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.