A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at
  Scale
        - URL: http://arxiv.org/abs/2304.11053v1
 - Date: Wed, 19 Apr 2023 18:09:27 GMT
 - Title: A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at
  Scale
 - Authors: Cal Peyser, Michael Picheny, Kyunghyun Cho, Rohit Prabhavalkar, Ronny
  Huang, Tara Sainath
 - Abstract summary: Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus.
In this work, we compare three state-of-the-art semi-supervised methods encompassing both unpaired text and audio as well as several of their combinations in a controlled setting.
We find that in our setting these methods offer many improvements beyond raw WER, including substantial gains in tail-word WER, decoder computation during inference, and lattice density.
 - Score: 64.10124092250126
 - License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
 - Abstract:   Unpaired text and audio injection have emerged as dominant methods for
improving ASR performance in the absence of a large labeled corpus. However,
little guidance exists on deploying these methods to improve production ASR
systems that are trained on very large supervised corpora and with realistic
requirements like a constrained model size and CPU budget, streaming
capability, and a rich lattice for rescoring and for downstream NLU tasks. In
this work, we compare three state-of-the-art semi-supervised methods
encompassing both unpaired text and audio as well as several of their
combinations in a controlled setting using joint training. We find that in our
setting these methods offer many improvements beyond raw WER, including
substantial gains in tail-word WER, decoder computation during inference, and
lattice density.
 
       
      
        Related papers
        - Large-Scale Model Enabled Semantic Communication Based on Robust   Knowledge Distillation [53.16213723669751]
Large-scale models (LSMs) can be an effective framework for semantic representation and understanding.<n>However, their direct deployment is often hindered by high computational complexity and resource requirements.<n>This paper proposes a novel knowledge distillation based semantic communication framework.
arXiv  Detail & Related papers  (2025-08-04T07:47:18Z) - Communication-Efficient Wireless Federated Fine-Tuning for Large-Scale   AI Models [13.742950928229078]
Low-Rank Adaptation (LoRA) addresses these issues by training compact, low-rank matrices instead of fully fine-tuning large models.
This paper introduces a wireless federated LoRA fine-tuning framework that optimize both learning performance and communication efficiency.
arXiv  Detail & Related papers  (2025-05-01T06:15:38Z) - Unified Speech Recognition: A Single Model for Auditory, Visual, and   Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv  Detail & Related papers  (2024-11-04T16:46:53Z) - Codec-ASR: Training Performant Automatic Speech Recognition Systems with   Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv  Detail & Related papers  (2024-07-03T20:51:41Z) - Efficient infusion of self-supervised representations in Automatic   Speech Recognition [1.2972104025246092]
Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks.
We propose two simple approaches that use framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model into the ASR architecture.
Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets.
arXiv  Detail & Related papers  (2024-04-19T05:01:12Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
  Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv  Detail & Related papers  (2023-08-14T08:19:24Z) - Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM)
This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature.
We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv  Detail & Related papers  (2023-05-11T02:02:53Z) - Unifying Synergies between Self-supervised Learning and Dynamic
  Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv  Detail & Related papers  (2023-01-22T17:12:58Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
  Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv  Detail & Related papers  (2022-06-21T06:08:30Z) - Fine-grained Multi-Modal Self-Supervised Learning [4.850800439026724]
Multi-Modal Self-Supervised Learning from videos has been shown to improve model's performance on various downstream tasks.
Such pre-training requires large batch sizes and a large amount of computation resources due to the noise present in uncurated data.
We propose a fine-grained multi-modal self-supervised training scheme that computes the similarity between embeddings at finer-scale.
arXiv  Detail & Related papers  (2021-12-22T19:17:45Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.