A Large-Scale Exploration of $μ$-Transfer
- URL: http://arxiv.org/abs/2404.05728v5
- Date: Wed, 26 Jun 2024 04:07:08 GMT
- Title: A Large-Scale Exploration of $μ$-Transfer
- Authors: Lucas Lingle,
- Abstract summary: $mu$-Transfer yields scaling rules for model.
introductors and learning rates.
$mu$-Transfer is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background.
We study models of up to 10B parameters and training budgets of up to 190B tokens, and find $mu$-Transfer works as intended for the majority of important cases, yet also identify a few cases where it may not.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large artificial neural networks have become a mainstay of language, vision, and audio processing and synthesis, yet their initializations and learning rates are often set in an unsophisticated fashion, due to the high cost of hyperparameter sweeps at scale. The $\mu$-Parameterization ($\mu$P) offers a potential solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the $\mu$P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? Studying models of up to 10B parameters and training budgets of up to 190B tokens, we find $\mu$-Transfer works as intended for the majority of important cases, yet also identify a few cases where it may not.
Related papers
- Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework [10.317740844867913]
We build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset.
We observe that even simple acquisition functions can enable principled training decisions across training models from 20M to 1B kernels.
arXiv Detail & Related papers (2025-03-26T22:19:47Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Parameter-Efficient Transfer Learning for Music Foundation Models [51.61531917413708]
We investigate the use of parameter-efficient transfer learning (PETL) for music foundation models.
PETL methods outperform both probing and fine-tuning on music auto-tagging.
PETL methods achieve similar results as fine-tuning with significantly less training cost.
arXiv Detail & Related papers (2024-11-28T20:50:40Z) - Warmstarting for Scaling Language Models [47.691182347349894]
Scaling model sizes to scale performance has worked remarkably well for the current large language models paradigm.
High training costs for contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups.
One direction to ameliorate the cost of pretraining large models is to warmstart the large-scale training from smaller models that are cheaper to tune.
arXiv Detail & Related papers (2024-11-11T20:02:29Z) - FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models [35.40065954148091]
FINE is a method based on the Learngene framework to initializing downstream networks leveraging pre-trained models.
It decomposes pre-trained knowledge into the product of matrices (i.e., $U$, $Sigma$, and $V$), where $U$ and $V$ are shared across network blocks as learngenes''
It consistently outperforms direct pre-training, particularly for smaller models, achieving state-of-the-art results across variable model sizes.
arXiv Detail & Related papers (2024-09-28T08:57:17Z) - Efficient Verification-Based Face Identification [50.616875565173274]
We study the problem of performing face verification with an efficient neural model $f$.
Our model leads to a substantially small $f$ requiring only 23k parameters and 5M floating point operations (FLOPS)
We use six face verification datasets to demonstrate that our method is on par or better than state-of-the-art models.
arXiv Detail & Related papers (2023-12-20T18:08:02Z) - Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge [0.704590071265998]
We study the sample complexity of online Q-learning methods when some prior knowledge about the dynamics is available or can be learned efficiently.
We present an optimistic Q-learning algorithm that achieves $tildemathcalO(textPoly(H)sqrtSAT)$ regret under perfect knowledge of $f$.
arXiv Detail & Related papers (2023-12-19T19:53:58Z) - A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning
with General Function Approximation [66.26739783789387]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.
MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.
Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z) - Integrated Variational Fourier Features for Fast Spatial Modelling with Gaussian Processes [7.5991638205413325]
For $N$ training points, exact inference has $O(N3)$ cost; with $M ll N$ features, state of the art sparse variational methods have $O(NM2)$ cost.
Recently, methods have been proposed using more sophisticated features; these promise $O(M3)$ cost, with good performance in low dimensional tasks such as spatial modelling, but they only work with a very limited class of kernels, excluding some of the most commonly used.
In this work, we propose integrated Fourier features, which extends these performance benefits to a very broad class of stationary co
arXiv Detail & Related papers (2023-08-27T15:44:28Z) - An Experimental Design Perspective on Model-Based Reinforcement Learning [73.37942845983417]
In practical applications of RL, it is expensive to observe state transitions from the environment.
We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process.
arXiv Detail & Related papers (2021-12-09T23:13:57Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z) - On Function Approximation in Reinforcement Learning: Optimism in the
Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning.
In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function.
Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.