Related papers: Scaling Exponents Across Parameterizations and Optimizers

Scaling Exponents Across Parameterizations and Optimizers

URL: http://arxiv.org/abs/2407.05872v2
Date: Tue, 16 Jul 2024 17:40:09 GMT
Title: Scaling Exponents Across Parameterizations and Optimizers
Authors: Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, Jeffrey Pennington,
Abstract summary: We propose a new perspective on parameterization by investigating a key assumption in prior work. Our empirical investigation includes tens of thousands of models trained with all combinations of threes. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
Score: 94.54718325264218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, our novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, we demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adam-atan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely.

Related papers

Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining [56.58170370127227]
We show that optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes. This work is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers.
arXiv Detail & Related papers (2025-03-06T18:58:29Z)
Sparsity May Be All You Need: Sparse Random Parameter Adaptation [7.269130161558109]
Full fine-tuning of large language models for alignment and task adaptation has become prohibitively expensive as models have grown in size. We propose reducing the number of trainable parameters by randomly selecting a small proportion of the model parameters to train on.
arXiv Detail & Related papers (2025-02-21T22:23:16Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts. Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
Compact Model Parameter Extraction via Derivative-Free Optimization [0.0]
Traditionally, parameter extraction is performed manually by dividing the complete set of parameters into smaller subsets. We employ derivative-free optimization to identify a good parameter set that best fits the compact model without performing an exhaustive number of simulations. We demonstrate the effectiveness of our approach by successfully modeling a diamond Schottky diode with the SPICE diode model and a GaN-on-SiC HEMT with the ASM-HEMT model.
arXiv Detail & Related papers (2024-06-24T06:52:50Z)
ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections [59.839926875976225]
We propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters.
arXiv Detail & Related papers (2024-05-30T17:26:02Z)
A Unified Gaussian Process for Branching and Nested Hyperparameter Optimization [19.351804144005744]
In deep learning, tuning parameters with conditional dependence are common in practice. New GP model accounts for the dependent structure among input variables through a new kernel function. High prediction accuracy and better optimization efficiency are observed in a series of synthetic simulations and real data applications of neural networks.
arXiv Detail & Related papers (2024-01-19T21:11:32Z)
Should We Learn Most Likely Functions or Parameters? [51.133793272222874]
We investigate the benefits and drawbacks of directly estimating the most likely function implied by the model and the data. We find that function-space MAP estimation can lead to flatter minima, better generalization, and improved to overfitting.
arXiv Detail & Related papers (2023-11-27T16:39:55Z)
Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning [91.5113227694443]
We propose a novel visual. sensuous-aware fine-Tuning (SPT) scheme. SPT allocates trainable parameters to task-specific important positions. Experiments on a wide range of downstream recognition tasks show that our SPT is complementary to the existing PEFT methods.
arXiv Detail & Related papers (2023-03-15T12:34:24Z)
On the Effectiveness of Parameter-Efficient Fine-Tuning [79.6302606855302]
Currently, many research works propose to only fine-tune a small portion of the parameters while keeping most of the parameters shared across different tasks. We show that all of the methods are actually sparse fine-tuned models and conduct a novel theoretical analysis of them. Despite the effectiveness of sparsity grounded by our theory, it still remains an open problem of how to choose the tunable parameters.
arXiv Detail & Related papers (2022-11-28T17:41:48Z)
Know Where You're Going: Meta-Learning for Parameter-Efficient Fine-tuning [34.66092282348687]
We show that taking the ultimate choice of fine-tuning method into consideration boosts the performance of parameter-efficient fine-tuning. We prime the pretrained model specifically for parameter-efficient fine-tuning, resulting in gains of up to 1.7 points on cross-lingual NER fine-tuning.
arXiv Detail & Related papers (2022-05-25T02:51:57Z)
No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models [132.90062129639705]
We propose a novel training strategy that encourages all parameters to be trained sufficiently. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained and we regularize it by decreasing its learning rate to prevent further overfitting.
arXiv Detail & Related papers (2022-02-06T00:22:28Z)
Hyperparameter Selection for Subsampling Bootstraps [0.0]
A subsampling method like BLB serves as a powerful tool for assessing the quality of estimators for massive data. The performance of the subsampling methods are highly influenced by the selection of tuning parameters. We develop a hyperparameter selection methodology, which can be used to select tuning parameters for subsampling methods. Both simulation studies and real data analysis demonstrate the superior advantage of our method.
arXiv Detail & Related papers (2020-06-02T17:10:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.