Continuous Speech Separation with Ad Hoc Microphone Arrays
- URL: http://arxiv.org/abs/2103.02378v1
- Date: Wed, 3 Mar 2021 13:01:08 GMT
- Title: Continuous Speech Separation with Ad Hoc Microphone Arrays
- Authors: Dongmei Wang, Takuya Yoshioka, Zhuo Chen, Xiaofei Wang, Tianyan Zhou,
Zhong Meng
- Abstract summary: Speech separation has been shown effective for multi-talker speech recognition.
In this paper, we extend this approach to continuous speech separation.
Two methods are proposed to mitigate a speech problem during single talker segments.
- Score: 35.87274524040486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech separation has been shown effective for multi-talker speech
recognition. Under the ad hoc microphone array setup where the array consists
of spatially distributed asynchronous microphones, additional challenges must
be overcome as the geometry and number of microphones are unknown beforehand.
Prior studies show, with a spatial-temporalinterleaving structure, neural
networks can efficiently utilize the multi-channel signals of the ad hoc array.
In this paper, we further extend this approach to continuous speech separation.
Several techniques are introduced to enable speech separation for real
continuous recordings. First, we apply a transformer-based network for
spatio-temporal modeling of the ad hoc array signals. In addition, two methods
are proposed to mitigate a speech duplication problem during single talker
segments, which seems more severe in the ad hoc array scenarios. One method is
device distortion simulation for reducing the acoustic mismatch between
simulated training data and real recordings. The other is speaker counting to
detect the single speaker segments and merge the output signal channels.
Experimental results for AdHoc-LibiCSS, a new dataset consisting of continuous
recordings of concatenated LibriSpeech utterances obtained by multiple
different devices, show the proposed separation method can significantly
improve the ASR accuracy for overlapped speech with little performance
degradation for single talker segments.
Related papers
- LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization [31.01716151301142]
We present a large-scale far-field overlapping speech dataset to advance research in speech separation, recognition, and speaker diarization.
This dataset is a critical resource for decoding Who said What and When'' in multi-talker, reverberant environments.
arXiv Detail & Related papers (2024-09-01T19:23:08Z) - Mixture Encoder Supporting Continuous Speech Separation for Meeting
Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation.
We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps.
Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z) - Multi-microphone Automatic Speech Segmentation in Meetings Based on
Circular Harmonics Features [0.0]
We propose a new set of spatial features based on direction-of-arrival estimations in the circular harmonic domain (CH-DOA)
Experiments on the AMI meeting corpus show that CH-DOA can improve the segmentation while being robust in the case of deactivated microphones.
arXiv Detail & Related papers (2023-06-07T09:09:00Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - VarArray: Array-Geometry-Agnostic Continuous Speech Separation [26.938313513582642]
Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription.
This paper proposes VarArray, an array-geometry-agnostic speech separation neural network model.
arXiv Detail & Related papers (2021-10-12T05:31:46Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Multi-microphone Complex Spectral Mapping for Utterance-wise and
Continuous Speech Separation [79.63545132515188]
We propose multi-microphone complex spectral mapping for speaker separation in reverberant conditions.
Our system is trained on simulated room impulse responses based on a fixed number of microphones arranged in a given geometry.
State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.
arXiv Detail & Related papers (2020-10-04T22:13:13Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.