Neural Speech Separation Using Spatially Distributed Microphones
- URL: http://arxiv.org/abs/2004.13670v1
- Date: Tue, 28 Apr 2020 17:16:31 GMT
- Title: Neural Speech Separation Using Spatially Distributed Microphones
- Authors: Dongmei Wang, Zhuo Chen and Takuya Yoshioka
- Abstract summary: This paper proposes a neural network based speech separation method using spatially distributed microphones.
Unlike with traditional microphone array settings, neither the number of microphones nor their spatial arrangement is known in advance.
Speech recognition experimental results show that the proposed method significantly outperforms baseline multi-channel speech separation systems.
- Score: 19.242927805448154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a neural network based speech separation method using
spatially distributed microphones. Unlike with traditional microphone array
settings, neither the number of microphones nor their spatial arrangement is
known in advance, which hinders the use of conventional multi-channel speech
separation neural networks based on fixed size input. To overcome this, a novel
network architecture is proposed that interleaves inter-channel processing
layers and temporal processing layers. The inter-channel processing layers
apply a self-attention mechanism along the channel dimension to exploit the
information obtained with a varying number of microphones. The temporal
processing layers are based on a bidirectional long short term memory (BLSTM)
model and applied to each channel independently. The proposed network leverages
information across time and space by stacking these two kinds of layers
alternately. Our network estimates time-frequency (TF) masks for each speaker,
which are then used to generate enhanced speech signals either with TF masking
or beamforming. Speech recognition experimental results show that the proposed
method significantly outperforms baseline multi-channel speech separation
systems.
Related papers
- MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware
Beamforming Network for Speech Separation [55.533789120204055]
We propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal.
Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source.
arXiv Detail & Related papers (2022-12-07T01:52:40Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - Continuous Speech Separation with Ad Hoc Microphone Arrays [35.87274524040486]
Speech separation has been shown effective for multi-talker speech recognition.
In this paper, we extend this approach to continuous speech separation.
Two methods are proposed to mitigate a speech problem during single talker segments.
arXiv Detail & Related papers (2021-03-03T13:01:08Z) - Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial
Clustering Masks [14.942060304734497]
spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations.
LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-channel recordings.
This paper integrates these two approaches, training LSTM speech models to clean the masks generated by the Model-based EM Source Separation and Localization (MESSL) spatial clustering method.
arXiv Detail & Related papers (2020-12-02T22:35:00Z) - Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM
Neural Networks [3.730592618611028]
We use LSTMs to enhance spatial clustering based time-frequency masks.
We achieve both the signal modeling performance of multiple single-channel LSTM-DNN speech enhancers and the signal separation performance.
We evaluate the intelligibility of the output of each system using word error rate from a Kaldi automatic speech recognizer.
arXiv Detail & Related papers (2020-12-02T22:29:29Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - DNN-Based Distributed Multichannel Mask Estimation for Speech
Enhancement in Microphone Arrays [0.0]
We propose to extend the distributed adaptive node-specific signal estimation approach to a neural networks framework.
In an array of two nodes, we show that this additional signal can be efficiently taken into account to predict the masks and leads to better speech enhancement performances.
arXiv Detail & Related papers (2020-02-13T11:08:00Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.