Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial
Clustering Masks
- URL: http://arxiv.org/abs/2012.02191v1
- Date: Wed, 2 Dec 2020 22:35:00 GMT
- Title: Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial
Clustering Masks
- Authors: Zhaoheng Ni, Felix Grezes, Viet Anh Trinh, Michael I. Mandel
- Abstract summary: spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations.
LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-channel recordings.
This paper integrates these two approaches, training LSTM speech models to clean the masks generated by the Model-based EM Source Separation and Localization (MESSL) spatial clustering method.
- Score: 14.942060304734497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spatial clustering techniques can achieve significant multi-channel noise
reduction across relatively arbitrary microphone configurations, but have
difficulty incorporating a detailed speech/noise model. In contrast, LSTM
neural networks have successfully been trained to recognize speech from noise
on single-channel inputs, but have difficulty taking full advantage of the
information in multi-channel recordings. This paper integrates these two
approaches, training LSTM speech models to clean the masks generated by the
Model-based EM Source Separation and Localization (MESSL) spatial clustering
method. By doing so, it attains both the spatial separation performance and
generality of multi-channel spatial clustering and the signal modeling
performance of multiple parallel single-channel LSTM speech enhancers. Our
experiments show that when our system is applied to the CHiME-3 dataset of
noisy tablet recordings, it increases speech quality as measured by the
Perceptual Evaluation of Speech Quality (PESQ) algorithm and reduces the word
error rate of the baseline CHiME-3 speech recognizer, as compared to the
default BeamformIt beamformer.
Related papers
- Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Exploring Self-Supervised Contrastive Learning of Spatial Sound Event
Representation [21.896817015593122]
MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios.
We propose a multi-level data augmentation pipeline that augments different levels of audio features.
We find that linear layers on top of the learned representation significantly outperform supervised models in terms of both event classification accuracy and localization error.
arXiv Detail & Related papers (2023-09-27T18:23:03Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - Combining Spatial Clustering with LSTM Speech Models for Multichannel
Speech Enhancement [3.730592618611028]
Recurrent neural networks using the LSTM architecture can achieve significant single-channel noise reduction.
It is not obvious, however, how to apply them to multi-channel inputs in a way that can generalize to new microphone configurations.
This paper combines the two approaches to attain both the spatial separation performance and generality of multichannel spatial clustering.
arXiv Detail & Related papers (2020-12-02T22:37:50Z) - Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM
Neural Networks [3.730592618611028]
We use LSTMs to enhance spatial clustering based time-frequency masks.
We achieve both the signal modeling performance of multiple single-channel LSTM-DNN speech enhancers and the signal separation performance.
We evaluate the intelligibility of the output of each system using word error rate from a Kaldi automatic speech recognizer.
arXiv Detail & Related papers (2020-12-02T22:29:29Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Neural Speech Separation Using Spatially Distributed Microphones [19.242927805448154]
This paper proposes a neural network based speech separation method using spatially distributed microphones.
Unlike with traditional microphone array settings, neither the number of microphones nor their spatial arrangement is known in advance.
Speech recognition experimental results show that the proposed method significantly outperforms baseline multi-channel speech separation systems.
arXiv Detail & Related papers (2020-04-28T17:16:31Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.