Multi-microphone Complex Spectral Mapping for Utterance-wise and
Continuous Speech Separation
- URL: http://arxiv.org/abs/2010.01703v2
- Date: Mon, 24 May 2021 15:00:30 GMT
- Title: Multi-microphone Complex Spectral Mapping for Utterance-wise and
Continuous Speech Separation
- Authors: Zhong-Qiu Wang and Peidong Wang and DeLiang Wang
- Abstract summary: We propose multi-microphone complex spectral mapping for speaker separation in reverberant conditions.
Our system is trained on simulated room impulse responses based on a fixed number of microphones arranged in a given geometry.
State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.
- Score: 79.63545132515188
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose multi-microphone complex spectral mapping, a simple way of
applying deep learning for time-varying non-linear beamforming, for speaker
separation in reverberant conditions. We aim at both speaker separation and
dereverberation. Our study first investigates offline utterance-wise speaker
separation and then extends to block-online continuous speech separation (CSS).
Assuming a fixed array geometry between training and testing, we train deep
neural networks (DNN) to predict the real and imaginary (RI) components of
target speech at a reference microphone from the RI components of multiple
microphones. We then integrate multi-microphone complex spectral mapping with
minimum variance distortionless response (MVDR) beamforming and post-filtering
to further improve separation, and combine it with frame-level speaker counting
for block-online CSS. Although our system is trained on simulated room impulse
responses (RIR) based on a fixed number of microphones arranged in a given
geometry, it generalizes well to a real array with the same geometry.
State-of-the-art separation performance is obtained on the simulated two-talker
SMS-WSJ corpus and the real-recorded LibriCSS dataset.
Related papers
- Online speaker diarization of meetings guided by speech separation [0.0]
Overlapped speech is notoriously problematic for speaker diarization systems.
We introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings.
arXiv Detail & Related papers (2024-01-30T09:09:22Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Multi-microphone Automatic Speech Segmentation in Meetings Based on
Circular Harmonics Features [0.0]
We propose a new set of spatial features based on direction-of-arrival estimations in the circular harmonic domain (CH-DOA)
Experiments on the AMI meeting corpus show that CH-DOA can improve the segmentation while being robust in the case of deactivated microphones.
arXiv Detail & Related papers (2023-06-07T09:09:00Z) - UNSSOR: Unsupervised Neural Speech Separation by Leveraging
Over-determined Training Mixtures [60.879679764741624]
In reverberant conditions, each microphone acquires a mixture signal of multiple speakers at a different location.
We propose UNSSOR, an algorithm for $textbfu$nsupervised $textbfn$eural.
We show that this loss can promote unsupervised separation of speakers.
arXiv Detail & Related papers (2023-05-31T17:28:02Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - Continuous Speech Separation with Ad Hoc Microphone Arrays [35.87274524040486]
Speech separation has been shown effective for multi-talker speech recognition.
In this paper, we extend this approach to continuous speech separation.
Two methods are proposed to mitigate a speech problem during single talker segments.
arXiv Detail & Related papers (2021-03-03T13:01:08Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.