A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at
Scale
- URL: http://arxiv.org/abs/2304.11053v1
- Date: Wed, 19 Apr 2023 18:09:27 GMT
- Title: A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at
Scale
- Authors: Cal Peyser, Michael Picheny, Kyunghyun Cho, Rohit Prabhavalkar, Ronny
Huang, Tara Sainath
- Abstract summary: Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus.
In this work, we compare three state-of-the-art semi-supervised methods encompassing both unpaired text and audio as well as several of their combinations in a controlled setting.
We find that in our setting these methods offer many improvements beyond raw WER, including substantial gains in tail-word WER, decoder computation during inference, and lattice density.
- Score: 64.10124092250126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unpaired text and audio injection have emerged as dominant methods for
improving ASR performance in the absence of a large labeled corpus. However,
little guidance exists on deploying these methods to improve production ASR
systems that are trained on very large supervised corpora and with realistic
requirements like a constrained model size and CPU budget, streaming
capability, and a rich lattice for rescoring and for downstream NLU tasks. In
this work, we compare three state-of-the-art semi-supervised methods
encompassing both unpaired text and audio as well as several of their
combinations in a controlled setting using joint training. We find that in our
setting these methods offer many improvements beyond raw WER, including
substantial gains in tail-word WER, decoder computation during inference, and
lattice density.
Related papers
- Communication-Efficient Wireless Federated Fine-Tuning for Large-Scale AI Models [13.742950928229078]
Low-Rank Adaptation (LoRA) addresses these issues by training compact, low-rank matrices instead of fully fine-tuning large models.
This paper introduces a wireless federated LoRA fine-tuning framework that optimize both learning performance and communication efficiency.
arXiv Detail & Related papers (2025-05-01T06:15:38Z) - Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - Efficient infusion of self-supervised representations in Automatic Speech Recognition [1.2972104025246092]
Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks.
We propose two simple approaches that use framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model into the ASR architecture.
Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets.
arXiv Detail & Related papers (2024-04-19T05:01:12Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM)
This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature.
We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Fine-grained Multi-Modal Self-Supervised Learning [4.850800439026724]
Multi-Modal Self-Supervised Learning from videos has been shown to improve model's performance on various downstream tasks.
Such pre-training requires large batch sizes and a large amount of computation resources due to the noise present in uncurated data.
We propose a fine-grained multi-modal self-supervised training scheme that computes the similarity between embeddings at finer-scale.
arXiv Detail & Related papers (2021-12-22T19:17:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.