Related papers: Continuous speech separation: dataset and analysis

Continuous speech separation: dataset and analysis

URL: http://arxiv.org/abs/2001.11482v3
Date: Thu, 7 May 2020 09:13:27 GMT
Title: Continuous speech separation: dataset and analysis
Authors: Zhuo Chen, Takuya Yoshioka, Liang Lu, Tianyan Zhou, Zhong Meng, Yi Luo, Jian Wu, Xiong Xiao, Jinyu Li
Abstract summary: In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
Score: 52.10378896407332
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior studies on speech separation use pre-segmented signals of artificially mixed speech utterances which are mostly \emph{fully} overlapped, and the algorithms are evaluated based on signal-to-distortion ratio or similar performance metrics. However, in natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. In addition, the signal-based metrics have very weak correlations with automatic speech recognition (ASR) accuracy. We think that not only does this make it hard to assess the practical relevance of the tested algorithms, it also hinders researchers from developing systems that can be readily applied to real scenarios. In this paper, we define continuous speech separation (CSS) as a task of generating a set of non-overlapped speech signals from a \textit{continuous} audio stream that contains multiple utterances that are \emph{partially} overlapped by a varying degree. A new real recorded dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones. A Kaldi-based ASR evaluation protocol is also established by using a well-trained multi-conditional acoustic model. By using this dataset, several aspects of a recently proposed speaker-independent CSS algorithm are investigated. The dataset and evaluation scripts are available to facilitate the research in this direction.

Related papers

Learning Disentangled Speech Representations [0.412484724941528]
SynSpeech is a novel large-scale synthetic speech dataset designed to enable research on disentangled speech representations. We present a framework to evaluate disentangled representation learning techniques, applying both linear probing and established supervised disentanglement metrics. We find that SynSpeech facilitates benchmarking across a range of factors, achieving promising disentanglement of simpler features like gender and speaking style, while highlighting challenges in isolating complex attributes like speaker identity.
arXiv Detail & Related papers (2023-11-04T04:54:17Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data. The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z)
Audio-text Retrieval in Context [24.38055340045366]
In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment. We build our contextual audio-text retrieval system using pre-trained audio features and a descriptor-based aggregation method. With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.
arXiv Detail & Related papers (2022-03-25T13:41:17Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Continuous Speech Separation with Ad Hoc Microphone Arrays [35.87274524040486]
Speech separation has been shown effective for multi-talker speech recognition. In this paper, we extend this approach to continuous speech separation. Two methods are proposed to mitigate a speech problem during single talker segments.
arXiv Detail & Related papers (2021-03-03T13:01:08Z)
Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation [79.63545132515188]
We propose multi-microphone complex spectral mapping for speaker separation in reverberant conditions. Our system is trained on simulated room impulse responses based on a fixed number of microphones arranged in a given geometry. State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.
arXiv Detail & Related papers (2020-10-04T22:13:13Z)
Evaluating the reliability of acoustic speech embeddings [10.5754802112615]
Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods. We find that overall, ABX and MAP correlate with one another and with frequency estimation.
arXiv Detail & Related papers (2020-07-27T13:24:09Z)
Unsupervised Cross-Modal Audio Representation Learning from Unstructured Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning. Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness. We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z)
End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection [48.80449801938696]
This paper integrates a voice activity detection function with end-to-end automatic speech recognition. We focus on connectionist temporal classification ( CTC) and its extension ofsynchronous/attention. We use the labels as a cue for detecting speech segments with simple thresholding.
arXiv Detail & Related papers (2020-02-03T03:36:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.