Attention is All You Need in Speech Separation
- URL: http://arxiv.org/abs/2010.13154v2
- Date: Mon, 8 Mar 2021 21:24:43 GMT
- Title: Attention is All You Need in Speech Separation
- Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, Jianyuan
Zhong
- Abstract summary: We propose a novel RNN-free Transformer-based neural network for speech separation.
The proposed model achieves state-of-the-art (SOTA) performance on the standard WSJ0-2/3mix datasets.
- Score: 12.57578429586883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recurrent Neural Networks (RNNs) have long been the dominant architecture in
sequence-to-sequence learning. RNNs, however, are inherently sequential models
that do not allow parallelization of their computations. Transformers are
emerging as a natural alternative to standard RNNs, replacing recurrent
computations with a multi-head attention mechanism. In this paper, we propose
the SepFormer, a novel RNN-free Transformer-based neural network for speech
separation. The SepFormer learns short and long-term dependencies with a
multi-scale approach that employs transformers. The proposed model achieves
state-of-the-art (SOTA) performance on the standard WSJ0-2/3mix datasets. It
reaches an SI-SNRi of 22.3 dB on WSJ0-2mix and an SI-SNRi of 19.5 dB on
WSJ0-3mix. The SepFormer inherits the parallelization advantages of
Transformers and achieves a competitive performance even when downsampling the
encoded representation by a factor of 8. It is thus significantly faster and it
is less memory-demanding than the latest speech separation systems with
comparable performance.
Related papers
- Were RNNs All We Needed? [53.393497486332]
We revisit traditional recurrent neural networks (RNNs) from over a decade ago.
We show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel.
arXiv Detail & Related papers (2024-10-02T03:06:49Z) - Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently.
We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm.
We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Resource-Efficient Separation Transformer [14.666016177212837]
This paper explores Transformer-based speech separation with a reduced computational cost.
Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture.
The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings.
arXiv Detail & Related papers (2022-06-19T23:37:24Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - A Fully Tensorized Recurrent Neural Network [48.50376453324581]
We introduce a "fully tensorized" RNN architecture which jointly encodes the separate weight matrices within each recurrent cell.
This approach reduces model size by several orders of magnitude, while still maintaining similar or better performance compared to standard RNNs.
arXiv Detail & Related papers (2020-10-08T18:24:12Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.