Speech Enhancement with Perceptually-motivated Optimization and Dual
Transformations
- URL: http://arxiv.org/abs/2209.11905v1
- Date: Sat, 24 Sep 2022 02:33:40 GMT
- Title: Speech Enhancement with Perceptually-motivated Optimization and Dual
Transformations
- Authors: Xucheng Wan, Kai Liu, Ziqing Du, Huan Zhou
- Abstract summary: We propose a sub-band based speech enhancement system with perceptually-motivated optimization and dual transformations, called PT-FSE.
Our proposed model achieves substantial improvements over its backbone, but also outperforms the current state-of-the-art while being 27% smaller than the SOTA.
With average NB-PESQ of 3.57 on the benchmark dataset, our system offers the best speech enhancement results reported till date.
- Score: 5.4878772986187565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To address the monaural speech enhancement problem, numerous research studies
have been conducted to enhance speech via operations either in time-domain on
the inner-domain learned from the speech mixture or in time--frequency domain
on the fixed full-band short time Fourier transform (STFT) spectrograms. Very
recently, a few studies on sub-band based speech enhancement have been
proposed. By enhancing speech via operations on sub-band spectrograms, those
studies demonstrated competitive performances on the benchmark dataset of
DNS2020. Despite attractive, this new research direction has not been fully
explored and there is still room for improvement. As such, in this study, we
delve into the latest research direction and propose a sub-band based speech
enhancement system with perceptually-motivated optimization and dual
transformations, called PT-FSE. Specially, our proposed PT-FSE model improves
its backbone, a full-band and sub-band fusion model, by three efforts. First,
we design a frequency transformation module that aims to strengthen the global
frequency correlation. Then a temporal transformation is introduced to capture
long range temporal contexts. Lastly, a novel loss, with leverage of properties
of human auditory perception, is proposed to facilitate the model to focus on
low frequency enhancement. To validate the effectiveness of our proposed model,
extensive experiments are conducted on the DNS2020 dataset. Experimental
results show that our PT-FSE system achieves substantial improvements over its
backbone, but also outperforms the current state-of-the-art while being 27\%
smaller than the SOTA. With average NB-PESQ of 3.57 on the benchmark dataset,
our system offers the best speech enhancement results reported till date.
Related papers
- Towards Robust Transcription: Exploring Noise Injection Strategies for Training Data Augmentation [55.752737615873464]
This study investigates the impact of white noise at various Signal-to-Noise Ratio (SNR) levels on state-of-the-art APT models.
We hope this research provides valuable insights as preliminary work toward developing transcription models that maintain consistent performance across a range of acoustic conditions.
arXiv Detail & Related papers (2024-10-18T02:31:36Z) - TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation [19.126525226518975]
We propose a speech separation model with significantly reduced parameters and computational costs.
TIGER leverages prior knowledge to divide frequency bands and compresses frequency information.
We show that TIGER achieves performance surpassing state-of-the-art (SOTA) model TF-GridNet.
arXiv Detail & Related papers (2024-10-02T12:21:06Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - SCP-GAN: Self-Correcting Discriminator Optimization for Training
Consistency Preserving Metric GAN on Speech Enhancement Tasks [28.261911789087463]
We introduce several improvements to the GAN training schemes, which can be applied to most GAN-based SE models.
We present self-correcting optimization for training a GAN discriminator on SE tasks, which helps avoid "harmful" training directions.
We have tested our proposed methods on several state-of-the-art GAN-based SE models and obtained consistent improvements.
arXiv Detail & Related papers (2022-10-26T04:48:40Z) - FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain.
Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z) - Time-domain Speech Enhancement with Generative Adversarial Learning [53.74228907273269]
This paper proposes a new framework called Time-domain Speech Enhancement Generative Adversarial Network (TSEGAN)
TSEGAN is an extension of the generative adversarial network (GAN) in time-domain with metric evaluation to mitigate the scaling problem.
In addition, we provide a new method based on objective function mapping for the theoretical analysis of the performance of Metric GAN.
arXiv Detail & Related papers (2021-03-30T08:09:49Z) - Speaker Representation Learning using Global Context Guided Channel and
Time-Frequency Transformations [67.18006078950337]
We use the global context information to enhance important channels and recalibrate salient time-frequency locations.
The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset.
arXiv Detail & Related papers (2020-09-02T01:07:29Z) - Improving noise robust automatic speech recognition with single-channel
time-domain enhancement network [100.1041336974175]
We show that a single-channel time-domain denoising approach can significantly improve ASR performance.
We show that single-channel noise reduction can still improve ASR performance.
arXiv Detail & Related papers (2020-03-09T09:36:31Z) - Single Channel Speech Enhancement Using Temporal Convolutional Recurrent
Neural Networks [23.88788382262305]
temporal convolutional recurrent network (TCRN) is an end-to-end model that directly map noisy waveform to clean waveform.
We show that our model is able to improve the performance of model, compared with existing convolutional recurrent networks.
arXiv Detail & Related papers (2020-02-02T04:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.