Exploring Self-Attention Mechanisms for Speech Separation
- URL: http://arxiv.org/abs/2202.02884v2
- Date: Sat, 27 May 2023 17:44:21 GMT
- Title: Exploring Self-Attention Mechanisms for Speech Separation
- Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, Francois Grondin, Mirko
Bronzi
- Abstract summary: This paper studies in-depth Transformers for speech separation.
We extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets.
Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers.
- Score: 11.210834842425955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have enabled impressive improvements in deep learning. They
often outperform recurrent and convolutional models in many tasks while taking
advantage of parallel processing. Recently, we proposed the SepFormer, which
obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix
datasets. This paper studies in-depth Transformers for speech separation. In
particular, we extend our previous findings on the SepFormer by providing
results on more challenging noisy and noisy-reverberant datasets, such as
LibriMix, WHAM!, and WHAMR!. Moreover, we extend our model to perform speech
enhancement and provide experimental evidence on denoising and dereverberation
tasks. Finally, we investigate, for the first time in speech separation, the
use of efficient self-attention mechanisms such as Linformers, Lonformers, and
ReFormers. We found that they reduce memory requirements significantly. For
example, we show that the Reformer-based attention outperforms the popular
Conv-TasNet model on the WSJ0-2Mix dataset while being faster at inference and
comparable in terms of memory consumption.
Related papers
- TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation [19.126525226518975]
We propose a speech separation model with significantly reduced parameters and computational costs.
TIGER leverages prior knowledge to divide frequency bands and compresses frequency information.
We show that TIGER achieves performance surpassing state-of-the-art (SOTA) model TF-GridNet.
arXiv Detail & Related papers (2024-10-02T12:21:06Z) - Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - Efficient Monaural Speech Enhancement using Spectrum Attention Fusion [15.8309037583936]
We present an improvement for speech enhancement models that maintains the expressiveness of self-attention while significantly reducing model complexity.
We construct a convolutional module to replace several self-attention layers in a speech Transformer, allowing the model to more efficiently fuse spectral features.
Our proposed model is able to achieve comparable or better results against SOTA models but with significantly smaller parameters (0.58M) on the Voice Bank + DEMAND dataset.
arXiv Detail & Related papers (2023-08-04T11:39:29Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - Directed Speech Separation for Automatic Speech Recognition of Long Form
Conversational Speech [10.291482850329892]
We propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal.
We achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.
arXiv Detail & Related papers (2021-12-10T23:07:48Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data.
Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception.
We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.