VarArray: Array-Geometry-Agnostic Continuous Speech Separation
- URL: http://arxiv.org/abs/2110.05745v1
- Date: Tue, 12 Oct 2021 05:31:46 GMT
- Title: VarArray: Array-Geometry-Agnostic Continuous Speech Separation
- Authors: Takuya Yoshioka, Xiaofei Wang, Dongmei Wang, Min Tang, Zirun Zhu, Zhuo
Chen, Naoyuki Kanda
- Abstract summary: Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription.
This paper proposes VarArray, an array-geometry-agnostic speech separation neural network model.
- Score: 26.938313513582642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continuous speech separation using a microphone array was shown to be
promising in dealing with the speech overlap problem in natural conversation
transcription. This paper proposes VarArray, an array-geometry-agnostic speech
separation neural network model. The proposed model is applicable to any number
of microphones without retraining while leveraging the nonlinear correlation
between the input channels. The proposed method adapts different elements that
were proposed before separately, including transform-average-concatenate,
conformer speech separation, and inter-channel phase differences, and combines
them in an efficient and cohesive way. Large-scale evaluation was performed
with two real meeting transcription tasks by using a fully developed
transcription system requiring no prior knowledge such as reference
segmentations, which allowed us to measure the impact that the continuous
speech separation system could have in realistic settings. The proposed model
outperformed a previous approach to array-geometry-agnostic modeling for all of
the geometry configurations considered, achieving asclite-based
speaker-agnostic word error rates of 17.5% and 20.4% for the AMI development
and evaluation sets, respectively, in the end-to-end setting using no
ground-truth segmentations.
Related papers
- VarArray Meets t-SOT: Advancing the State of the Art of Streaming
Distant Conversational Speech Recognition [36.580955189182404]
This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry.
Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT)
Our system achieves the state-of-the-art word error rates of 13.7% and 15.5% for the AMI development and evaluation sets, respectively, in the multiple-distant
arXiv Detail & Related papers (2022-09-12T01:22:04Z) - Bi-LSTM Scoring Based Similarity Measurement with Agglomerative
Hierarchical Clustering (AHC) for Speaker Diarization [0.0]
A typical conversation between two speakers consists of segments where their voices overlap, interrupt each other or halt their speech in between multiple sentences.
Recent advancements in diarization technology leverage neural network-based approaches to improvise speaker diarization system.
We propose a Bi-directional Long Short-term Memory network for estimating the elements present in the similarity matrix.
arXiv Detail & Related papers (2022-05-19T17:20:51Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Continuous Speech Separation with Ad Hoc Microphone Arrays [35.87274524040486]
Speech separation has been shown effective for multi-talker speech recognition.
In this paper, we extend this approach to continuous speech separation.
Two methods are proposed to mitigate a speech problem during single talker segments.
arXiv Detail & Related papers (2021-03-03T13:01:08Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Continuous Speech Separation with Conformer [60.938212082732775]
We use transformer and conformer in lieu of recurrent neural networks in the separation system.
We believe capturing global information with the self-attention based method is crucial for the speech separation.
arXiv Detail & Related papers (2020-08-13T09:36:05Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.