BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition
- URL: http://arxiv.org/abs/2509.15430v1
- Date: Thu, 18 Sep 2025 21:09:29 GMT
- Title: BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition
- Authors: Liuyuan Jiang, Xiaodong Cui, Brian Kingsbury, Tianyi Chen, Lisha Chen,
- Abstract summary: BiRQ is a bilevel SSL framework that combines the efficiency of BEST-RQ with the refinement benefits of HuBERT-style label enhancement.<n>We validate our method on various datasets, including 960-hour LibriSpeech, 150-hour AMI meetings and 5,000-hour YODAS.
- Score: 63.45645200463539
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech is a rich signal, and labeled audio-text pairs are costly, making self-supervised learning essential for scalable representation learning. A core challenge in speech SSL is generating pseudo-labels that are both informative and efficient: strong labels, such as those used in HuBERT, improve downstream performance but rely on external encoders and multi-stage pipelines, while efficient methods like BEST-RQ achieve simplicity at the cost of weaker labels. We propose BiRQ, a bilevel SSL framework that combines the efficiency of BEST-RQ with the refinement benefits of HuBERT-style label enhancement. The key idea is to reuse part of the model itself as a pseudo-label generator: intermediate representations are discretized by a random-projection quantizer to produce enhanced labels, while anchoring labels derived directly from the raw input stabilize training and prevent collapse. Training is formulated as an efficient first-order bilevel optimization problem, solved end-to-end with differentiable Gumbel-softmax selection. This design eliminates the need for external label encoders, reduces memory cost, and enables iterative label refinement in an end-to-end fashion. BiRQ consistently improves over BEST-RQ while maintaining low complexity and computational efficiency. We validate our method on various datasets, including 960-hour LibriSpeech, 150-hour AMI meetings and 5,000-hour YODAS, demonstrating consistent gains over BEST-RQ.
Related papers
- Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement [8.804897656598051]
We propose SiDyP: Simplex Label Diffusion with Dynamic Prior to calibrate the classifier's prediction.<n>Our framework can increase the performance of the BERT classifier fine-tuned on both zero-shot and few-shot LLM-generated noisy label datasets by an average of 7.21% and 7.30% respectively.
arXiv Detail & Related papers (2025-05-26T08:31:55Z) - Efficient Adaptive Label Refinement for Label Noise Learning [14.617885790129336]
We propose Adaptive Label Refinement (ALR) to avoid incorrect labels and thoroughly learning clean samples.<n>ALR is simple and efficient, requiring no prior knowledge of noise or auxiliary datasets.<n>We validate ALR's effectiveness through experiments on benchmark datasets with artificial label noise (CIFAR-10/100) and real-world datasets with inherent noise (ANIMAL-10N, Clothing1M, WebVision)
arXiv Detail & Related papers (2025-02-01T09:58:08Z) - Improved Adaptive Algorithm for Scalable Active Learning with Weak
Labeler [89.27610526884496]
Weak Labeler Active Cover (WL-AC) is able to robustly leverage the lower quality weak labelers to reduce the query complexity while retaining the desired level of accuracy.
We show its effectiveness on the corrupted-MNIST dataset by significantly reducing the number of labels while keeping the same accuracy as in passive learning.
arXiv Detail & Related papers (2022-11-04T02:52:54Z) - Filter and evolve: progressive pseudo label refining for semi-supervised
automatic speech recognition [5.735000563764309]
Low quality pseudo labels can misguide decision boundaries and degrade performance.
We propose a simple yet effective strategy to filter low quality pseudo labels.
Experiments on LibriSpeech show that these filtered samples enable the refined model to yield more correct predictions.
arXiv Detail & Related papers (2022-10-28T16:15:58Z) - SELC: Self-Ensemble Label Correction Improves Learning with Noisy Labels [4.876988315151037]
Deep neural networks are prone to overfitting noisy labels, resulting in poor generalization performance.
We present a method self-ensemble label correction (SELC) to progressively correct noisy labels and refine the model.
SELC obtains more promising and stable results in the presence of class-conditional, instance-dependent, and real-world label noise.
arXiv Detail & Related papers (2022-05-02T18:42:47Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - A Label Dependence-aware Sequence Generation Model for Multi-level
Implicit Discourse Relation Recognition [31.179555215952306]
Implicit discourse relation recognition is a challenging but crucial task in discourse analysis.
We propose a Label Dependence-aware Sequence Generation Model (LDSGM) for it.
We develop a mutual learning enhanced training method to exploit the label dependence in a bottomup direction.
arXiv Detail & Related papers (2021-12-22T09:14:03Z) - S3: Supervised Self-supervised Learning under Label Noise [53.02249460567745]
In this paper we address the problem of classification in the presence of label noise.
In the heart of our method is a sample selection mechanism that relies on the consistency between the annotated label of a sample and the distribution of the labels in its neighborhood in the feature space.
Our method significantly surpasses previous methods on both CIFARCIFAR100 with artificial noise and real-world noisy datasets such as WebVision and ANIMAL-10N.
arXiv Detail & Related papers (2021-11-22T15:49:20Z) - Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
We propose a semi-supervised learning (SSL) approach that uses unlabeled examples to train models.
Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection.
arXiv Detail & Related papers (2021-09-01T23:52:29Z) - In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label
Selection Framework for Semi-Supervised Learning [53.1047775185362]
Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation.
We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models.
We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process.
arXiv Detail & Related papers (2021-01-15T23:29:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.