On End-to-end Multi-channel Time Domain Speech Separation in Reverberant
Environments
- URL: http://arxiv.org/abs/2011.05958v1
- Date: Wed, 11 Nov 2020 18:25:07 GMT
- Title: On End-to-end Multi-channel Time Domain Speech Separation in Reverberant
Environments
- Authors: Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker
- Abstract summary: This paper introduces a new method for multi-channel time domain speech separation in reverberant environments.
A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings.
To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied.
- Score: 33.79711018198589
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a new method for multi-channel time domain speech
separation in reverberant environments. A fully-convolutional neural network
structure has been used to directly separate speech from multiple microphone
recordings, with no need of conventional spatial feature extraction. To reduce
the influence of reverberation on spatial feature extraction, a dereverberation
pre-processing method has been applied to further improve the separation
performance. A spatialized version of wsj0-2mix dataset has been simulated to
evaluate the proposed system. Both source separation and speech recognition
performance of the separated signals have been evaluated objectively.
Experiments show that the proposed fully-convolutional network improves the
source separation metric and the word error rate (WER) by more than 13% and 50%
relative, respectively, over a reference system with conventional features.
Applying dereverberation as pre-processing to the proposed system can further
reduce the WER by 29% relative using an acoustic model trained on clean and
reverberated data.
Related papers
- Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in
High-order Latent Domain [34.23260020137834]
We propose the Stepwise-Refining Speech Separation Network (SRSSN), which follows a coarse-to-fine separation framework.
It first learns a 1-order latent domain to define an encoding space and thereby performs a rough separation in the coarse phase.
It then learns a new latent domain along each basis function of the existing latent domain to obtain a high-order latent domain in the refining phase.
arXiv Detail & Related papers (2021-10-10T13:21:16Z) - Blind Room Parameter Estimation Using Multiple-Multichannel Speech
Recordings [37.145413836886455]
Knowing the geometrical and acoustical parameters of a room may benefit applications such as audio augmented reality, speech dereverberation or audio forensics.
We study the problem of jointly estimating the total surface area, the volume, as well as the frequency-dependent reverberation time and mean surface absorption of a room.
A novel convolutional neural network architecture leveraging both single- and inter-channel cues is proposed and trained on a large, realistic simulated dataset.
arXiv Detail & Related papers (2021-07-29T08:51:49Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - Time-Domain Speech Extraction with Spatial Information and Multi Speaker
Conditioning Mechanism [27.19635746008699]
We present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture.
The proposed method is built on an improved multi-channel time-domain speech separation network.
Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline.
arXiv Detail & Related papers (2021-02-07T10:11:49Z) - Continuous Speech Separation with Conformer [60.938212082732775]
We use transformer and conformer in lieu of recurrent neural networks in the separation system.
We believe capturing global information with the self-attention based method is crucial for the speech separation.
arXiv Detail & Related papers (2020-08-13T09:36:05Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.