Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers
- URL: http://arxiv.org/abs/2006.10930v2
- Date: Sat, 8 Aug 2020 20:25:22 GMT
- Title: Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers
- Authors: Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen,
Tianyan Zhou, Takuya Yoshioka
- Abstract summary: We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
- Score: 38.3469744871394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an end-to-end speaker-attributed automatic speech recognition
model that unifies speaker counting, speech recognition, and speaker
identification on monaural overlapped speech. Our model is built on serialized
output training (SOT) with attention-based encoder-decoder, a recently proposed
method for recognizing overlapped speech comprising an arbitrary number of
speakers. We extend SOT by introducing a speaker inventory as an auxiliary
input to produce speaker labels as well as multi-speaker transcriptions. All
model parameters are optimized by speaker-attributed maximum mutual information
criterion, which represents a joint probability for overlapped speech
recognition and speaker identification. Experiments on LibriSpeech corpus show
that our proposed method achieves significantly better speaker-attributed word
error rate than the baseline that separately performs overlapped speech
recognition and speaker identification.
Related papers
- Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition [27.35304346509647]
We introduce speaker labels into an autoregressive transformer-based speech recognition model.
We then propose a novel speaker mask branch to detection the speech segments of individual speakers.
With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously.
arXiv Detail & Related papers (2023-12-18T06:29:53Z) - One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Content-Aware Speaker Embeddings for Speaker Diarisation [3.6398652091809987]
The content-aware speaker embeddings (CASE) approach is proposed.
Case factorises automatic speech recognition (ASR) from speaker recognition to focus on modelling speaker characteristics.
Case achieved a 17.8% relative speaker error rate reduction over conventional methods.
arXiv Detail & Related papers (2021-02-12T12:02:03Z) - U-vectors: Generating clusterable speaker embedding from unlabeled data [0.0]
This paper introduces a speaker recognition strategy dealing with unlabeled data.
It generates clusterable embedding vectors from small fixed-size speech frames.
We conclude that the proposed approach achieves remarkable performance using pairwise architectures.
arXiv Detail & Related papers (2021-02-07T18:00:09Z) - A Machine of Few Words -- Interactive Speaker Recognition with
Reinforcement Learning [35.36769027019856]
We present a new paradigm for automatic speaker recognition that we call Interactive Speaker Recognition (ISR)
In this paradigm, the recognition system aims to incrementally build a representation of the speakers by requesting personalized utterances.
We show that our method achieves excellent performance while using little speech signal amounts.
arXiv Detail & Related papers (2020-08-07T12:44:08Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.