Training Autoregressive Speech Recognition Models with Limited in-domain
Supervision
- URL: http://arxiv.org/abs/2210.15135v1
- Date: Thu, 27 Oct 2022 02:49:23 GMT
- Title: Training Autoregressive Speech Recognition Models with Limited in-domain
Supervision
- Authors: Chak-Fai Li, Francis Keith, William Hartmann, Matthew Snover
- Abstract summary: We explore limited supervision in the domain of conversational speech.
We augment the XLS-R model with open source read speech data.
We demonstrate that by using the XLS-R model for pseudotranscription, a much smaller autoregressive model can outperform a finetuned XLS-R model.
- Score: 6.519568453645212
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advances in self-supervised learning have significantly reduced the amount of
transcribed audio required for training. However, the majority of work in this
area is focused on read speech. We explore limited supervision in the domain of
conversational speech. While we assume the amount of in-domain data is limited,
we augment the model with open source read speech data. The XLS-R model has
been shown to perform well with limited adaptation data and serves as a strong
baseline. We use untranscribed data for self-supervised learning and
semi-supervised training in an autoregressive encoder-decoder model. We
demonstrate that by using the XLS-R model for pseudotranscription, a much
smaller autoregressive model can outperform a finetuned XLS-R model when
transcribed in-domain data is limited, reducing WER by as much as 8% absolute.
Related papers
- Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Large Language Model (LLM) pretraining traditionally relies on autoregressive language modeling on randomly sampled data blocks from web-scale datasets.
We take inspiration from human learning techniques like spaced repetition to hypothesize that random data sampling for LLMs leads to high training cost and low quality models which tend to forget data.
In order to effectively commit web-scale information to long-term memory, we propose the LFR (Learn, Focus, and Review) pedagogy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context.
Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z) - Exploring Representation Learning for Small-Footprint Keyword Spotting [11.586285744728068]
Main challenges of KWS are limited labeled data and limited available device resources.
To address those challenges, we explore representation learning for KWS by self-supervised contrastive learning and self-training with pretrained model.
Experiments on speech commands dataset show that the self-training WVC module and the self-supervised LGCSiam module significantly improve accuracy.
arXiv Detail & Related papers (2023-03-20T07:09:26Z) - Speculative Decoding with Big Little Decoder [108.95187338417541]
Big Little Decoder (BiLD) is a framework that can improve inference efficiency and latency for a wide range of text generation applications.
On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation.
Our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture.
arXiv Detail & Related papers (2023-02-15T18:55:29Z) - Model Extraction Attack against Self-supervised Speech Models [52.81330435990717]
Self-supervised learning (SSL) speech models generate meaningful representations of given clips.
Model extraction attack (MEA) often refers to an adversary stealing the functionality of the victim model with only query access.
We study the MEA problem against SSL speech model with a small number of queries.
arXiv Detail & Related papers (2022-11-29T09:28:05Z) - LT-LM: a novel non-autoregressive language model for single-shot lattice
rescoring [55.16665077221941]
We propose a novel rescoring approach, which processes the entire lattice in a single call to the model.
The key feature of our rescoring policy is a novel non-autoregressive Lattice Transformer Language Model (LT-LM)
arXiv Detail & Related papers (2021-04-06T14:06:07Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - Improving Streaming Automatic Speech Recognition With Non-Streaming
Model Distillation On Unsupervised Data [44.48235209327319]
Streaming end-to-end automatic speech recognition models are widely used on smart speakers and on-device applications.
We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher.
We scale the training of streaming models to up to 3 million hours of YouTube audio.
arXiv Detail & Related papers (2020-10-22T22:41:33Z) - Improving Unsupervised Sparsespeech Acoustic Models with Categorical
Reparameterization [31.977418525076626]
We extend the Sparsespeech model to allow for sampling over a random variable, yielding pseudo-posteriorgrams.
The new and improved model is trained and evaluated on the Libri-Light corpus, a benchmark for ASR with limited or no supervision.
We observe a relative improvement of up to 31.4% on ABX error rates across speakers on the test set with the improved model.
arXiv Detail & Related papers (2020-05-29T13:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.