Exploring the limits of decoder-only models trained on public speech
recognition corpora
- URL: http://arxiv.org/abs/2402.00235v1
- Date: Wed, 31 Jan 2024 23:29:42 GMT
- Title: Exploring the limits of decoder-only models trained on public speech
recognition corpora
- Authors: Ankit Gupta, George Saon, Brian Kingsbury
- Abstract summary: Decoder-Only Transformer for ASR (DOTA) model comprehensively outperforms the encoder-decoder open source replication of Whisper (OWSM) on nearly all English ASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets.
- Score: 36.446905777292066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The emergence of industrial-scale speech recognition (ASR) models such as
Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio
only proprietary data respectively, has led to a stronger need for large scale
public ASR corpora and competitive open source pipelines. Unlike the said
models, large language models are typically based on Transformer decoders, and
it remains unclear if decoder-only models trained on public data alone can
deliver competitive performance. In this work, we investigate factors such as
choice of training datasets and modeling components necessary for obtaining the
best performance using public English ASR corpora alone. Our Decoder-Only
Transformer for ASR (DOTA) model comprehensively outperforms the
encoder-decoder open source replication of Whisper (OWSM) on nearly all English
ASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets. We
release our codebase and model checkpoints under permissive license.
Related papers
- Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - Anatomy of Industrial Scale Multilingual ASR [13.491861238522421]
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system.
Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages.
arXiv Detail & Related papers (2024-04-15T14:48:43Z) - OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification [44.94458898538114]
We propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC)
It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID)
Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 24% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference.
arXiv Detail & Related papers (2024-02-20T02:04:38Z) - Digits micro-model for accurate and secure transactions [0.5999777817331317]
We highlight the potential of smaller, specialized "micro" speech recognition models.
Unlike larger speech recognition models, micro-models are trained on carefully selected and curated datasets.
Our work contributes to domain-specific ASR models, improving digit recognition accuracy, and privacy of data.
arXiv Detail & Related papers (2024-02-02T22:01:27Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - A Lexical-aware Non-autoregressive Transformer-based ASR Model [9.500518278458905]
We propose a lexical-aware non-autoregressive Transformer-based (LA-NAT) ASR framework, which consists of an acoustic encoder, a speech-text shared encoder, and a speech-text shared decoder.
LA-NAT aims to make the ASR model aware of lexical information, so the resulting model is expected to achieve better results by leveraging the learned linguistic knowledge.
arXiv Detail & Related papers (2023-05-18T09:50:47Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.