Semi-Supervised Speech Recognition via Local Prior Matching
- URL: http://arxiv.org/abs/2002.10336v1
- Date: Mon, 24 Feb 2020 16:07:11 GMT
- Title: Semi-Supervised Speech Recognition via Local Prior Matching
- Authors: Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, Awni Hannun
- Abstract summary: Local prior matching is a semi-supervised objective that distills knowledge from a strong prior.
We demonstrate that LPM is theoretically well-native, simple to implement, and superior to existing knowledge distillation techniques.
- Score: 42.311823406287864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For sequence transduction tasks like speech recognition, a strong structured
prior model encodes rich information about the target space, implicitly ruling
out invalid sequences by assigning them low probability. In this work, we
propose local prior matching (LPM), a semi-supervised objective that distills
knowledge from a strong prior (e.g. a language model) to provide learning
signal to a discriminative model trained on unlabeled speech. We demonstrate
that LPM is theoretically well-motivated, simple to implement, and superior to
existing knowledge distillation techniques under comparable settings. Starting
from a baseline trained on 100 hours of labeled speech, with an additional 360
hours of unlabeled data, LPM recovers 54% and 73% of the word error rate on
clean and noisy test sets relative to a fully supervised model on the same
data.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - Augmenting Automatic Speech Recognition Models with Disfluency Detection [12.45703869323415]
Speech disfluency commonly occurs in conversational and spontaneous speech.
Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech.
We present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies.
arXiv Detail & Related papers (2024-09-16T11:13:14Z) - Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models [84.8919069953397]
Self-TAught Recognizer (STAR) is an unsupervised adaptation framework for speech recognition systems.
We show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains.
STAR exhibits high data efficiency that only requires less than one-hour unlabeled data.
arXiv Detail & Related papers (2024-05-23T04:27:11Z) - Detecting Speech Abnormalities with a Perceiver-based Sequence
Classifier that Leverages a Universal Speech Model [4.503292461488901]
We propose a Perceiver-based sequence to detect abnormalities in speech reflective of several neurological disorders.
We combine this sequence with a Universal Speech Model (USM) that is trained (unsupervised) on 12 million hours of diverse audio recordings.
Our model outperforms standard transformer (80.9%) and perceiver (81.8%) models and achieves an average accuracy of 83.1%.
arXiv Detail & Related papers (2023-10-16T21:07:12Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Wake Word Detection with Alignment-Free Lattice-Free MMI [66.12175350462263]
Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input.
We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data.
We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures.
arXiv Detail & Related papers (2020-05-17T19:22:25Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.