Single headed attention based sequence-to-sequence model for
state-of-the-art results on Switchboard
- URL: http://arxiv.org/abs/2001.07263v3
- Date: Tue, 20 Oct 2020 03:33:19 GMT
- Title: Single headed attention based sequence-to-sequence model for
state-of-the-art results on Switchboard
- Authors: Zolt\'an T\"uske, George Saon, Kartik Audhkhasi, Brian Kingsbury
- Abstract summary: We show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database.
Using a cross-utterance language model, our single-pass speaker independent system reaches 6.4% and 12.5% word error rate (WER) on the Switchboard and CallHome subsets of Hub5'00.
- Score: 36.06535394840605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is generally believed that direct sequence-to-sequence (seq2seq) speech
recognition models are competitive with hybrid models only when a large amount
of data, at least a thousand hours, is available for training. In this paper,
we show that state-of-the-art recognition performance can be achieved on the
Switchboard-300 database using a single headed attention, LSTM based model.
Using a cross-utterance language model, our single-pass speaker independent
system reaches 6.4% and 12.5% word error rate (WER) on the Switchboard and
CallHome subsets of Hub5'00, without a pronunciation lexicon. While careful
regularization and data augmentation are crucial in achieving this level of
performance, experiments on Switchboard-2000 show that nothing is more useful
than more data. Overall, the combination of various regularizations and a
simple but fairly large model results in a new state of the art, 4.7% and 7.8%
WER on the Switchboard and CallHome sets, using SWB-2000 without any external
data resources.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained
Foundation Models [17.87796508561949]
We introduce a multilingual speaker change detection model (USM-SCD) that can simultaneously detect speaker turns and perform ASR for 96 languages.
We show that the USM-SCD model can achieve more than 75% average speaker change detection F1 score across a test set that consists of data from 96 languages.
arXiv Detail & Related papers (2023-09-14T20:46:49Z) - Evaluation of Speech Representations for MOS prediction [0.7329200485567826]
In this paper, we evaluate feature extraction models for predicting speech quality.
We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models.
arXiv Detail & Related papers (2023-06-16T17:21:42Z) - Robust Speech Recognition via Large-Scale Weak Supervision [69.63329359286419]
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.
When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks.
We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
arXiv Detail & Related papers (2022-12-06T18:46:04Z) - A Single Self-Supervised Model for Many Speech Modalities Enables
Zero-Shot Modality Transfer [31.028408352051684]
We present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech.
Our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input.
arXiv Detail & Related papers (2022-07-14T16:21:33Z) - Raw waveform speaker verification for supervised and self-supervised
learning [30.08242210230669]
This paper proposes a new raw waveform speaker verification model that incorporates techniques proven effective for speaker verification.
Under the best performing configuration, the model shows an equal error rate of 0.89%, competitive with state-of-the-art models.
We also explore the proposed model with a self-supervised learning framework and show the state-of-the-art performance in this line of research.
arXiv Detail & Related papers (2022-03-16T09:28:03Z) - Automatic Learning of Subword Dependent Model Scales [50.105894487730545]
We show that the model scales for a combination of attention encoder-decoder acoustic model and language model can be learned as effectively as with manual tuning.
We extend this approach to subword dependent model scales which could not be tuned manually which leads to 7% improvement on LBS and 3% on SWB.
arXiv Detail & Related papers (2021-10-18T13:48:28Z) - On the limit of English conversational speech recognition [28.395662280898787]
We show that a single headed attention encoder-decoder model is able to reach state-of-the-art results in conversational speech recognition.
We reduce the recognition errors of our LSTM system on Switchboard-300 by 4% relative.
We report 5.9% and 11.5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM models.
arXiv Detail & Related papers (2021-05-03T16:32:38Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.