MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation
- URL: http://arxiv.org/abs/2104.08450v1
- Date: Sat, 17 Apr 2021 05:02:04 GMT
- Title: MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation
- Authors: Xiyun Li and Yong Xu and Meng Yu and Shi-Xiong Zhang and Jiaming Xu
and Bo Xu and Dong Yu
- Abstract summary: Recently, our proposed recurrent neural network (RNN) based all deep learning minimum variance distortionless response (ADL-MVDR) beamformer method yielded superior performance over the conventional MVDR.
We present a self-attentive RNN beamformer to further improve our previous RNN-based beamformer by leveraging on the powerful modeling capability of self-attention.
- Score: 45.90599689005832
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, our proposed recurrent neural network (RNN) based all deep learning
minimum variance distortionless response (ADL-MVDR) beamformer method yielded
superior performance over the conventional MVDR by replacing the matrix
inversion and eigenvalue decomposition with two RNNs.In this work, we present a
self-attentive RNN beamformer to further improve our previous RNN-based
beamformer by leveraging on the powerful modeling capability of self-attention.
Temporal-spatial self-attention module is proposed to better learn the
beamforming weights from the speech and noise spatial covariance matrices. The
temporal self-attention module could help RNN to learn global statistics of
covariance matrices. The spatial self-attention module is designed to attend on
the cross-channel correlation in the covariance matrices. Furthermore, a
multi-channel input with multi-speaker directional features and multi-speaker
speech separation outputs (MIMO) model is developed to improve the inference
efficiency.The evaluations demonstrate that our proposed MIMO self-attentive
RNN beamformer improves both the automatic speech recognition (ASR) accuracy
and the perceptual estimation of speech quality (PESQ) against prior arts.
Related papers
- Multi-Loss Convolutional Network with Time-Frequency Attention for
Speech Enhancement [16.701596804113553]
We explore self-attention in the DPCRN module and design a model called Multi-Loss Convolutional Network with Time-Frequency Attention(MNTFA) for speech enhancement.
Compared to DPRNN, axial self-attention greatly reduces the need for memory and computation.
We propose a joint training method of a multi-resolution STFT loss and a WavLM loss using a pre-trained WavLM network.
arXiv Detail & Related papers (2023-06-15T08:48:19Z) - Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation [39.64103126881576]
We propose a complex-valued T-F attention (TFA) module that models spectral and temporal dependencies.
We validate the effectiveness of our proposed complex-valued TFA module with the deep complex convolutional recurrent network (DCCRN) using the REVERB challenge corpus.
Experimental findings indicate that integrating our complex-TFA module with DCCRN improves overall speech quality and performance of back-end speech applications.
arXiv Detail & Related papers (2022-11-22T23:38:10Z) - VQ-T: RNN Transducers using Vector-Quantized Prediction Network States [52.48566999668521]
We propose to use vector-quantized long short-term memory units in the prediction network of RNN transducers.
By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation.
Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks.
arXiv Detail & Related papers (2022-08-03T02:45:52Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Multi-turn RNN-T for streaming recognition of multi-party speech [2.899379040028688]
This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T)
We introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set.
We propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture.
arXiv Detail & Related papers (2021-12-19T17:22:58Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - Self-Attention for Audio Super-Resolution [0.0]
We propose a network architecture for audio super-resolution that combines convolution and self-attention.
Attention-based Feature-Wise Linear Modulation (AFiLM) uses self-attention mechanism instead of recurrent neural networks to modulate the activations of the convolutional model.
arXiv Detail & Related papers (2021-08-26T08:05:07Z) - Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T)
We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T.
Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z) - Distributional Reinforcement Learning for mmWave Communications with
Intelligent Reflectors on a UAV [119.97450366894718]
A novel communication framework that uses an unmanned aerial vehicle (UAV)-carried intelligent reflector (IR) is proposed.
In order to maximize the downlink sum-rate, the optimal precoding matrix (at the base station) and reflection coefficient (at the IR) are jointly derived.
arXiv Detail & Related papers (2020-11-03T16:50:37Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.