Efficient Monaural Speech Enhancement using Spectrum Attention Fusion
- URL: http://arxiv.org/abs/2308.02263v1
- Date: Fri, 4 Aug 2023 11:39:29 GMT
- Title: Efficient Monaural Speech Enhancement using Spectrum Attention Fusion
- Authors: Jinyu Long and Jetic G\=u and Binhao Bai and Zhibo Yang and Ping Wei
and Junli Li
- Abstract summary: We present an improvement for speech enhancement models that maintains the expressiveness of self-attention while significantly reducing model complexity.
We construct a convolutional module to replace several self-attention layers in a speech Transformer, allowing the model to more efficiently fuse spectral features.
Our proposed model is able to achieve comparable or better results against SOTA models but with significantly smaller parameters (0.58M) on the Voice Bank + DEMAND dataset.
- Score: 15.8309037583936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech enhancement is a demanding task in automated speech processing
pipelines, focusing on separating clean speech from noisy channels. Transformer
based models have recently bested RNN and CNN models in speech enhancement,
however at the same time they are much more computationally expensive and
require much more high quality training data, which is always hard to come by.
In this paper, we present an improvement for speech enhancement models that
maintains the expressiveness of self-attention while significantly reducing
model complexity, which we have termed Spectrum Attention Fusion. We carefully
construct a convolutional module to replace several self-attention layers in a
speech Transformer, allowing the model to more efficiently fuse spectral
features. Our proposed model is able to achieve comparable or better results
against SOTA models but with significantly smaller parameters (0.58M) on the
Voice Bank + DEMAND dataset.
Related papers
- TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation [19.126525226518975]
We propose a speech separation model with significantly reduced parameters and computational costs.
TIGER leverages prior knowledge to divide frequency bands and compresses frequency information.
We show that TIGER achieves performance surpassing state-of-the-art (SOTA) model TF-GridNet.
arXiv Detail & Related papers (2024-10-02T12:21:06Z) - Non-autoregressive real-time Accent Conversion model with voice cloning [0.0]
We have developed a non-autoregressive model for real-time accent conversion with voice cloning.
The model generates native-sounding L1 speech with minimal latency based on input L2 speech.
The model has the ability to save, clone and change the timbre, gender and accent of the speaker's voice in real time.
arXiv Detail & Related papers (2024-05-21T19:07:26Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Guided Speech Enhancement Network [17.27704800294671]
Multi-microphone speech enhancement problem is often decomposed into two decoupled steps: a beamformer that provides spatial filtering and a single-channel speech enhancement model.
We propose a speech enhancement solution that takes both the raw microphone and beamformer outputs as the input for an ML model.
We name the ML module in our solution as GSENet, short for Guided Speech Enhancement Network.
arXiv Detail & Related papers (2023-03-13T21:48:20Z) - CHAPTER: Exploiting Convolutional Neural Network Adapters for
Self-supervised Speech Models [62.60723685118747]
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data.
We propose an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor.
We empirically found that adding CNN to the feature extractor can help the adaptation on emotion and speaker tasks.
arXiv Detail & Related papers (2022-12-01T08:50:12Z) - Exploring Self-Attention Mechanisms for Speech Separation [11.210834842425955]
This paper studies in-depth Transformers for speech separation.
We extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets.
Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers.
arXiv Detail & Related papers (2022-02-06T23:13:27Z) - Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and
Accented Speech [5.960279280033886]
We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning.
We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.
arXiv Detail & Related papers (2021-09-14T20:04:47Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z) - Exploring Deep Hybrid Tensor-to-Vector Network Architectures for
Regression Based Speech Enhancement [53.47564132861866]
We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size.
CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality.
arXiv Detail & Related papers (2020-07-25T22:21:05Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - A Streaming On-Device End-to-End Model Surpassing Server-Side
Conventional Model Quality and Latency [88.08721721440429]
We develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer.
We find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model.
arXiv Detail & Related papers (2020-03-28T05:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.