Unified Autoregressive Modeling for Joint End-to-End Multi-Talker
Overlapped Speech Recognition and Speaker Attribute Estimation
- URL: http://arxiv.org/abs/2107.01549v1
- Date: Sun, 4 Jul 2021 05:47:18 GMT
- Title: Unified Autoregressive Modeling for Joint End-to-End Multi-Talker
Overlapped Speech Recognition and Speaker Attribute Estimation
- Authors: Ryo Masumura, Daiki Okamura, Naoki Makishima, Mana Ihori, Akihiko
Takashima, Tomohiro Tanaka, Shota Orihashi
- Abstract summary: We present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems.
We propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation.
- Score: 26.911867847630187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a novel modeling method for single-channel
multi-talker overlapped automatic speech recognition (ASR) systems. Fully
neural network based end-to-end models have dramatically improved the
performance of multi-taker overlapped ASR tasks. One promising approach for
end-to-end modeling is autoregressive modeling with serialized output training
in which transcriptions of multiple speakers are recursively generated one
after another. This enables us to naturally capture relationships between
speakers. However, the conventional modeling method cannot explicitly take into
account the speaker attributes of individual utterances such as gender and age
information. In fact, the performance deteriorates when each speaker is the
same gender or is close in age. To address this problem, we propose unified
autoregressive modeling for joint end-to-end multi-talker overlapped ASR and
speaker attribute estimation. Our key idea is to handle gender and age
estimation tasks within the unified autoregressive modeling. In the proposed
method, transformer-based autoregressive model recursively generates not only
textual tokens but also attribute tokens of each speaker. This enables us to
effectively utilize speaker attributes for improving multi-talker overlapped
ASR. Experiments on Japanese multi-talker overlapped ASR tasks demonstrate the
effectiveness of the proposed method.
Related papers
- From Modular to End-to-End Speaker Diarization [3.079020586262228]
We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx.
We describe an approach for generating synthetic data which resembles real conversations in terms of speaker turns and overlaps.
We show how this method generating simulated conversations'' allows for better performance than using a previously proposed method for creating simulated mixtures'' when training the popular EEND.
arXiv Detail & Related papers (2024-06-27T15:09:39Z) - Adapting Multi-Lingual ASR Models for Handling Multiple Talkers [63.151811561972515]
State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.
We propose an approach to adapt USMs for multi-talker ASR.
We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction.
arXiv Detail & Related papers (2023-05-30T05:05:52Z) - Unified Modeling of Multi-Talker Overlapped Speech Recognition and
Diarization with a Sidecar Separator [42.8787280791491]
Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization.
We propose a cost-effective method to convert a single-talker automatic speech recognition system into a multi-talker one.
We incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters.
arXiv Detail & Related papers (2023-05-25T17:18:37Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Simulating realistic speech overlaps improves multi-talker ASR [36.39193360559079]
We propose an improved technique to simulate multi-talker overlapping speech with realistic speech overlaps.
With this representation, speech overlapping patterns can be learned from real conversations based on a statistical language model, such as N-gram.
In our experiments, multi-talker ASR models trained with the proposed method show consistent improvement on the word error rates across multiple datasets.
arXiv Detail & Related papers (2022-10-27T18:29:39Z) - Automatic Evaluation of Speaker Similarity [0.0]
We introduce a new automatic evaluation method for speaker similarity assessment, consistent with human perceptual scores.
Our experiments show that we can train a model to predict speaker similarity MUSHRA scores from speaker embeddings with 0.96 accuracy and significant correlation up to 0.78 Pearson score at the utterance level.
arXiv Detail & Related papers (2022-07-01T11:23:16Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.