Constrained Parameter Regularization
- URL: http://arxiv.org/abs/2311.09058v2
- Date: Wed, 6 Dec 2023 14:20:53 GMT
- Title: Constrained Parameter Regularization
- Authors: J\"org K.H. Franke, Michael Hefenbrock, Gregor Koehler, Frank Hutter
- Abstract summary: Regularization is a critical component in deep learning training.
We present constrained parameter regularization (CPR) as an alternative to traditional weight decay.
CPR counteracts the effects of grokking and consistently matches or outperforms traditional weight decay.
- Score: 41.055148686036176
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Regularization is a critical component in deep learning training, with weight
decay being a commonly used approach. It applies a constant penalty coefficient
uniformly across all parameters. This may be unnecessarily restrictive for some
parameters, while insufficiently restricting others. To dynamically adjust
penalty coefficients for different parameter groups, we present constrained
parameter regularization (CPR) as an alternative to traditional weight decay.
Instead of applying a single constant penalty to all parameters, we enforce an
upper bound on a statistical measure (e.g., the L$_2$-norm) of parameter
groups. Consequently, learning becomes a constraint optimization problem, which
we address by an adaptation of the augmented Lagrangian method. CPR only
requires two hyperparameters and incurs no measurable runtime overhead.
Additionally, we propose a simple but efficient mechanism to adapt the upper
bounds during the optimization. We provide empirical evidence of CPR's efficacy
in experiments on the "grokking" phenomenon, computer vision, and language
modeling tasks. Our results demonstrate that CPR counteracts the effects of
grokking and consistently matches or outperforms traditional weight decay.
Related papers
- Regularized Low-Rank Adaptation for Few-Shot Organ Segmentation [17.875098424936542]
Low-Rank Adaptation (LoRA) is a notable approach based on the assumption that the adaptation inherently occurs in a low-dimensional subspace.<n>We introduce a novel approach for medical image segmentation that dynamically adjusts the intrinsic rank during adaptation.<n>Our method is evaluated in a realistic few-shot fine-tuning setting, where we compare it first to the standard LoRA and then to several other PEFT methods.
arXiv Detail & Related papers (2025-07-21T16:51:53Z) - On the Role of Weight Decay in Collaborative Filtering: A Popularity Perspective [38.87580457343038]
Collaborative filtering (CF) enables large-scale recommendation systems by encoding information from historical user-item interactions into dense ID-embedding tables.<n>We argue that one core component of these pipelines is heavily overlooked: weight decay.<n>We propose PRISM (Popularity-awaRe Initialization Strategy for embedding Magnitudes) to simplify the training of high-performing CF models.
arXiv Detail & Related papers (2025-05-16T14:41:57Z) - Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We propose a new family of algorithms that uses the linear minimization oracle (LMO) to adapt to the geometry of the problem.<n>We demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.
arXiv Detail & Related papers (2025-02-11T13:10:34Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.
Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models [27.847140934456288]
This paper proposes a new weight decay technique, Selective Projection Decay (SPD)
SPD selectively imposes a strong penalty on certain layers while allowing others to change freely.
When equipped with SPD, Adam consistently provides better in-distribution robustness and out-of-distribution performance on benchmarks.
arXiv Detail & Related papers (2024-11-03T23:36:53Z) - LoRTA: Low Rank Tensor Adaptation of Large Language Models [70.32218116940393]
Low Rank Adaptation (LoRA) is a popular Efficient Fine Tuning (PEFT) method.
We propose a higher-order Candecomp/Parafac (CP) decomposition, enabling a more compact and flexible representation.
Our method can achieve a reduction in the number of parameters while maintaining comparable performance.
arXiv Detail & Related papers (2024-10-05T06:59:50Z) - SVFit: Parameter-Efficient Fine-Tuning of Large Pre-Trained Models Using Singular Values [12.137869917556415]
Large pre-trained models (LPMs) have demonstrated exceptional performance in diverse natural language processing and computer vision tasks.
fully fine-tuning these models poses substantial memory challenges, particularly in resource-constrained environments.
We propose SVFit, a novel PEFT approach that leverages singular value decomposition (SVD) to low-rank matrices using critical singular values as trainable parameters.
arXiv Detail & Related papers (2024-09-09T08:44:53Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [143.23123791557245]
Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP.
We propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score.
We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA.
arXiv Detail & Related papers (2023-03-18T22:36:25Z) - Differentially Private Learning with Per-Sample Adaptive Clipping [8.401653565794353]
We propose a Differentially Private Per-Sample Adaptive Clipping (DP-PSAC) algorithm based on a non-monotonic adaptive weight function.
We show that DP-PSAC outperforms or matches the state-of-the-art methods on multiple main-stream vision and language tasks.
arXiv Detail & Related papers (2022-12-01T07:26:49Z) - META-STORM: Generalized Fully-Adaptive Variance Reduced SGD for
Unbounded Functions [23.746620619512573]
Recent work overcomes the effect of having to compute gradients of "megabatches"
Work is widely used after update with competitive deep learning tasks.
arXiv Detail & Related papers (2022-09-29T15:12:54Z) - Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a.
sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training.
Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z) - Efficient and Differentiable Conformal Prediction with General Function
Classes [96.74055810115456]
We propose a generalization of conformal prediction to multiple learnable parameters.
We show that it achieves approximate valid population coverage and near-optimal efficiency within class.
Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly.
arXiv Detail & Related papers (2022-02-22T18:37:23Z) - Constrained Optimization for Training Deep Neural Networks Under Class
Imbalance [9.557146081524008]
We introduce a novel constraint that can be used with existing loss functions to enforce maximal area under the ROC curve.
We present experimental results for image-based classification applications using the CIFAR10 and an in-house medical imaging dataset.
arXiv Detail & Related papers (2021-02-21T09:49:36Z) - Rethinking the Hyperparameters for Fine-tuning [78.15505286781293]
Fine-tuning from pre-trained ImageNet models has become the de-facto standard for various computer vision tasks.
Current practices for fine-tuning typically involve selecting an ad-hoc choice of hyper parameters.
This paper re-examines several common practices of setting hyper parameters for fine-tuning.
arXiv Detail & Related papers (2020-02-19T18:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.