Directed Speech Separation for Automatic Speech Recognition of Long Form
Conversational Speech
- URL: http://arxiv.org/abs/2112.05863v1
- Date: Fri, 10 Dec 2021 23:07:48 GMT
- Title: Directed Speech Separation for Automatic Speech Recognition of Long Form
Conversational Speech
- Authors: Rohit Paturi, Sundararajan Srinivasan, Katrin Kirchhoff
- Abstract summary: We propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal.
We achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.
- Score: 10.291482850329892
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many of the recent advances in speech separation are primarily aimed at
synthetic mixtures of short audio utterances with high degrees of overlap.
These datasets significantly differ from the real conversational data and
hence, the models trained and evaluated on these datasets do not generalize to
real conversational scenarios. Another issue with using most of these models
for long form speech is the nondeterministic ordering of separated speech
segments due to either unsupervised clustering for time-frequency masks or
Permutation Invariant training (PIT) loss. This leads to difficulty in
accurately stitching homogenous speaker segments for downstream tasks like
Automatic Speech Recognition (ASR). In this paper, we propose a speaker
conditioned separator trained on speaker embeddings extracted directly from the
mixed signal. We train this model using a directed loss which regulates the
order of the separated segments. With this model, we achieve significant
improvements on Word error rate (WER) for real conversational data without the
need for an additional re-stitching step.
Related papers
- Online speaker diarization of meetings guided by speech separation [0.0]
Overlapped speech is notoriously problematic for speaker diarization systems.
We introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings.
arXiv Detail & Related papers (2024-01-30T09:09:22Z) - Improved Long-Form Speech Recognition by Jointly Modeling the Primary
and Non-primary Speakers [35.32552447347255]
We introduce a novel technique to simultaneously model different groups of speakers in the audio along with the standard transcript tokens.
Speakers are grouped as primary and non-primary, which connects the application domains.
This improved model neither needs any additional training data nor incurs additional training or inference cost.
arXiv Detail & Related papers (2023-12-18T11:47:39Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - Monaural Multi-Speaker Speech Separation Using Efficient Transformer
Model [0.0]
"Monaural multi-speaker speech separation" presents a speech-separation model based on the Transformer architecture and its efficient forms.
The model has been trained with the LibriMix dataset containing diverse speakers' utterances.
arXiv Detail & Related papers (2023-07-29T15:10:46Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors.
End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network.
We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z) - End-to-End Neural Diarization: Reformulating Speaker Diarization as
Simple Multi-label Classification [45.38809571153867]
We propose the End-to-End Neural Diarization (EEND) in which a neural network directly outputs speaker diarization results.
By feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations.
arXiv Detail & Related papers (2020-02-24T14:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.