Analyzing And Improving Neural Speaker Embeddings for ASR
- URL: http://arxiv.org/abs/2301.04571v2
- Date: Wed, 20 Sep 2023 07:43:13 GMT
- Title: Analyzing And Improving Neural Speaker Embeddings for ASR
- Authors: Christoph L\"uscher, Jingjing Xu, Mohammad Zeineldeen, Ralf
Schl\"uter, Hermann Ney
- Abstract summary: We present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system.
Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.
- Score: 54.30093015525726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural speaker embeddings encode the speaker's speech characteristics through
a DNN model and are prevalent for speaker verification tasks. However, few
studies have investigated the usage of neural speaker embeddings for an ASR
system. In this work, we present our efforts w.r.t integrating neural speaker
embeddings into a conformer based hybrid HMM ASR system. For ASR, our improved
embedding extraction pipeline in combination with the Weighted-Simple-Add
integration method results in x-vector and c-vector reaching on par performance
with i-vectors. We further compare and analyze different speaker embeddings. We
present our acoustic model improvements obtained by switching from newbob
learning rate schedule to one cycle learning schedule resulting in a ~3%
relative WER reduction on Switchboard, additionally reducing the overall
training time by 17%. By further adding neural speaker embeddings, we gain
additional ~3% relative WER improvement on Hub5'00. Our best Conformer-based
hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and
Hub5'01 with training on SWB 300h.
Related papers
- Improving the Training Recipe for a Robust Conformer-based Hybrid Model [46.78701739177677]
We investigate various methods for speaker adaptive training (SAT) based on feature-space approaches for a conformer-based acoustic model (AM)
We propose a method, called weighted-Simple-Add, which adds weighted speaker information vectors to the input of the multi-head self-attention module of the conformer AM.
We extend and improve this recipe where we achieve 11% relative improvement in terms of word-error-rate (WER) on Switchboard 300h Hub5'00 dataset.
arXiv Detail & Related papers (2022-06-26T20:01:08Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Feature Replacement and Combination for Hybrid ASR Systems [47.74348197215634]
We investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems.
In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features.
We obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
arXiv Detail & Related papers (2021-04-09T11:04:58Z) - Combination of Deep Speaker Embeddings for Diarisation [9.053645441056256]
This paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different NN components.
A neural-based single-pass speaker diarisation pipeline is also proposed in this paper.
Experiments and detailed analyses are conducted on the challenging AMI and NIST RT05 datasets.
arXiv Detail & Related papers (2020-10-22T20:16:36Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.