End-to-End Speaker-Attributed ASR with Transformer
- URL: http://arxiv.org/abs/2104.02128v1
- Date: Mon, 5 Apr 2021 19:54:15 GMT
- Title: End-to-End Speaker-Attributed ASR with Transformer
- Authors: Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo
Chen, Takuya Yoshioka
- Abstract summary: This paper presents an end-to-end speaker-attributed automatic speech recognition system.
It jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio.
- Score: 41.7739129773237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents our recent effort on end-to-end speaker-attributed
automatic speech recognition, which jointly performs speaker counting, speech
recognition and speaker identification for monaural multi-talker audio.
Firstly, we thoroughly update the model architecture that was previously
designed based on a long short-term memory (LSTM)-based attention encoder
decoder by applying transformer architectures. Secondly, we propose a speaker
deduplication mechanism to reduce speaker identification errors in highly
overlapped regions. Experimental results on the LibriSpeechMix dataset shows
that the transformer-based architecture is especially good at counting the
speakers and that the proposed model reduces the speaker-attributed word error
rate by 47% over the LSTM-based baseline. Furthermore, for the LibriCSS
dataset, which consists of real recordings of overlapped speech, the proposed
model achieves concatenated minimum-permutation word error rates of 11.9% and
16.3% with and without target speaker profiles, respectively, both of which are
the state-of-the-art results for LibriCSS with the monaural setting.
Related papers
- One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Lexical Speaker Error Correction: Leveraging Language Models for Speaker
Diarization Error Correction [4.409889336732851]
Speaker diarization (SD) is typically used with an automatic speech recognition (ASR) system to ascribe speaker labels to recognized words.
This approach can lead to speaker errors especially around speaker turns and regions of speaker overlap.
We propose a novel second-pass speaker error correction system using lexical information.
arXiv Detail & Related papers (2023-06-15T17:47:41Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Investigation of Speaker-adaptation methods in Transformer based ASR [8.637110868126548]
This paper explores different ways of incorporating speaker information at the encoder input while training a transformer-based model to improve its speech recognition performance.
We present speaker information in the form of speaker embeddings for each of the speakers.
We obtain improvements in the word error rate over the baseline through our approach of integrating speaker embeddings into the model.
arXiv Detail & Related papers (2020-08-07T16:09:03Z) - Self-attention encoding and pooling for speaker recognition [16.96341561111918]
We propose a tandem Self-Attention and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances.
SAEP encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification.
We have evaluated this approach on both VoxCeleb1 & 2 datasets.
arXiv Detail & Related papers (2020-08-03T09:31:27Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.