Related papers: Large scale weakly and semi-supervised learning for low-resource video ASR

Large scale weakly and semi-supervised learning for low-resource video ASR

URL: http://arxiv.org/abs/2005.07850v2
Date: Fri, 7 Aug 2020 01:17:55 GMT
Title: Large scale weakly and semi-supervised learning for low-resource video ASR
Authors: Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed
Abstract summary: We compare self-labeling and weakly-supervised pretraining approaches for transcribing social media videos. We find that sequence-level distillation for encoder-decoder models provides the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline.
Score: 32.33625853364696
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline WERs by more than 8%, sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline.

Related papers

Open Source State-Of-the-Art Solution for Romanian Speech Recognition [47.27624927463166]
We present a new Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture.<n>We train our model on a large corpus of weakly supervised transcriptions, totaling over 2,600 hours of speech.<n>Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks.
arXiv Detail & Related papers (2025-11-05T11:02:16Z)
BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation [9.90081460759926]
We propose BEARD, a novel framework designed to adapt Whisper's encoder using unlabeled data.<n>Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder.<n>Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology.
arXiv Detail & Related papers (2025-10-28T16:01:24Z)
Towards Unsupervised Speech Recognition at the Syllable-Level [95.54031547995874]
We introduce a syllable-level UASR framework based on masked language modeling.<n>We generalize effectively to Mandarin, a language that has remained particularly difficult for prior methods.
arXiv Detail & Related papers (2025-10-04T02:56:33Z)
High-Fidelity Speech Enhancement via Discrete Audio Tokens [35.61634772862795]
DAC-SE1 is a language model-based SE framework leveraging discrete high-resolution audio representations.<n>Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation.
arXiv Detail & Related papers (2025-10-02T16:38:05Z)
Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges [58.80034860169605]
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech.<n>This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions.
arXiv Detail & Related papers (2025-07-24T07:56:24Z)
BRIDLE: Generalized Self-supervised Learning with Quantization [15.121857164574704]
Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio. We introduce BRIDLE, a self-supervised pretraining framework that incorporates residual quantization into the bidirectional training process.
arXiv Detail & Related papers (2025-02-04T08:54:06Z)
Robust Persian Digit Recognition in Noisy Environments Using Hybrid CNN-BiGRU Model [1.5566524830295307]
This study addresses isolated spoken Persian digit recognition (zero to nine) under noisy conditions. A hybrid model combining residual convolutional neural networks and bidirectional gated units (BiGRU) is proposed. Experimental results demonstrate the model's effectiveness, achieving 98.53%, 96.10%, and 95.92% accuracy on training, validation, and test sets.
arXiv Detail & Related papers (2024-12-14T15:11:42Z)
Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems. To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder. Our models achieve competitive word error rates (WER) of approximately 2.5% for English and surpass existing approaches for Spanish.
arXiv Detail & Related papers (2024-07-09T07:15:56Z)
Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings. We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z)
Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition [1.0690007351232649]
We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$%$ and 17.2$%$ in character error rate (CER) across multi accent test datasets.
arXiv Detail & Related papers (2024-07-03T11:35:52Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale [64.10124092250126]
Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus. In this work, we compare three state-of-the-art semi-supervised methods encompassing both unpaired text and audio as well as several of their combinations in a controlled setting. We find that in our setting these methods offer many improvements beyond raw WER, including substantial gains in tail-word WER, decoder computation during inference, and lattice density.
arXiv Detail & Related papers (2023-04-19T18:09:27Z)
Introducing Semantics into Speech Encoders [91.37001512418111]
We propose an unsupervised way of incorporating semantic information from large language models into self-supervised speech encoders without labeled audio transcriptions. Our approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts.
arXiv Detail & Related papers (2022-11-15T18:44:28Z)
Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers [33.725831884078744]
The proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach. We investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs.
arXiv Detail & Related papers (2021-07-07T04:12:06Z)
Capturing scattered discriminative information using a deep architecture in acoustic scene classification [49.86640645460706]
In this study, we investigate various methods to capture discriminative information and simultaneously mitigate the overfitting problem. We adopt a max feature map method to replace conventional non-linear activations in a deep neural network. Two data augment methods and two deep architecture modules are further explored to reduce overfitting and sustain the system's discriminative power.
arXiv Detail & Related papers (2020-07-09T08:32:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.