Related papers: Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping

Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping

URL: http://arxiv.org/abs/2406.02004v2
Date: Wed, 5 Jun 2024 21:44:10 GMT
Title: Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping
Authors: Lun Wang, Om Thakkar, Zhong Meng, Nicole Rafidi, Rohit Prabhavalkar, Arun Narayanan,
Abstract summary: Per-core clip-ping (PCC) can effectively mitigate unintended memorization in ASR models. PCC positively influences ASR performance metrics, leading to improved convergence rates and reduced word error rates.
Score: 27.547461769425855
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Gradient clipping plays a vital role in training large-scale automatic speech recognition (ASR) models. It is typically applied to minibatch gradients to prevent gradient explosion, and to the individual sample gradients to mitigate unintended memorization. This work systematically investigates the impact of a specific granularity of gradient clipping, namely per-core clip-ping (PCC), across training a wide range of ASR models. We empirically demonstrate that PCC can effectively mitigate unintended memorization in ASR models. Surprisingly, we find that PCC positively influences ASR performance metrics, leading to improved convergence rates and reduced word error rates. To avoid tuning the additional hyperparameter introduced by PCC, we further propose a novel variant, adaptive per-core clipping (APCC), for streamlined optimization. Our findings highlight the multifaceted benefits of PCC as a strategy for robust, privacy-forward ASR model training.

Related papers

Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners [1.2461503242570642]
This paper presents a study on adapting the Contrastive Captioners (CoCa) visual backbone for few-shot image classification.<n>We identify an "augmentation divergence": while strong data augmentation degrades the performance of linear probing in low-shot settings, it is essential for stabilizing LoRA fine-tuning.<n>We also demonstrate that hybrid objectives incorporating Supervised Contrastive (SupCon) loss yield consistent performance improvements over standard Cross-Entropy.
arXiv Detail & Related papers (2025-12-14T20:13:21Z)
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z)
Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation [7.870062030206608]
Unstructured pruning remains a powerful strategy for compressing deep neural networks.<n>We introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation.<n>Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations.
arXiv Detail & Related papers (2025-11-20T18:56:05Z)
Tractable Sharpness-Aware Learning of Probabilistic Circuits [9.353446248109599]
Probabilistic Circuits (PCs) are a class of generative models that allow exact and tractable inference for a wide range of queries.<n>Recent developments have enabled the learning of deep and expressive PCs, but this increased capacity can often lead to overfitting.<n>Inspired by sharpness aware minimization in neural networks, we propose a Hessian-based regularizer for training PCs.
arXiv Detail & Related papers (2025-08-07T16:13:24Z)
Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping [4.817614848684669]
Differential privacy (DP) has become an essential framework for privacy-preserving machine learning.<n> Gradient clipping, which is often used in DP learning, can suppress larger gradients from challenging samples.<n>We show that this problem is amplified by adaptive clipping, which will often shrink the clipping bound to tiny values to match a well-fitting majority.<n>We propose bounded adaptive clipping, which introduces a tunable lower bound to prevent excessive gradient suppression.
arXiv Detail & Related papers (2025-06-02T07:44:17Z)
ROCM: RLHF on consistency models [8.905375742101707]
We propose a reward optimization framework for applying RLHF to consistency models. We investigate various $f$-divergences as regularization strategies, striking a balance between reward and model consistency.
arXiv Detail & Related papers (2025-03-08T11:19:48Z)
Fine Tuning without Catastrophic Forgetting via Selective Low Rank Adaptation [13.084333776247743]
Fine-tuning can reduce robustness to distribution shifts, impacting out-of-distribution (OOD) performance. We propose a parameter-efficient fine-tuning (PEFT) method, using an indicator function to selectively activate Low-Rank Adaptation (LoRA) blocks. We demonstrate that effective fine-tuning can be achieved with as few as 5% of active blocks, substantially improving efficiency.
arXiv Detail & Related papers (2025-01-26T03:22:22Z)
Enhancing DP-SGD through Non-monotonous Adaptive Scaling Gradient Weight [15.139854970044075]
We introduce Differentially Private Per-sample Adaptive Scaling Clipping (DP-PSASC) This approach replaces traditional clipping with non-monotonous adaptive gradient scaling. Our theoretical and empirical analyses confirm that DP-PSASC preserves gradient privacy and delivers superior performance across diverse datasets.
arXiv Detail & Related papers (2024-11-05T12:47:30Z)
DiSK: Differentially Private Optimizer with Simplified Kalman Filter for Noise Reduction [57.83978915843095]
This paper introduces DiSK, a novel framework designed to significantly enhance the performance of differentially private gradients. To ensure practicality for large-scale training, we simplify the Kalman filtering process, minimizing its memory and computational demands.
arXiv Detail & Related papers (2024-10-04T19:30:39Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Reactive Model Correction: Mitigating Harm to Task-Relevant Features via Conditional Bias Suppression [12.44857030152608]
Deep Neural Networks are prone to learning and relying on spurious correlations in the training data, which, for high-risk applications, can have fatal consequences. Various approaches to suppress model reliance on harmful features have been proposed that can be applied post-hoc without additional training. We propose a reactive approach conditioned on model-derived knowledge and eXplainable Artificial Intelligence (XAI) insights.
arXiv Detail & Related papers (2024-04-15T09:16:49Z)
Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem. PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins. Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC) We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.