How to Scale Your EMA
- URL: http://arxiv.org/abs/2307.13813v3
- Date: Tue, 7 Nov 2023 17:57:42 GMT
- Title: How to Scale Your EMA
- Authors: Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko,
Eeshan Gunesh Dhekane, Xavier Suau, Russ Webb
- Abstract summary: We provide a scaling rule for optimization in the presence of a model EMA.
We show the rule's validity where the model EMA contributes to the optimization of the target model.
For Self-Supervised Learning, we enable training of BYOL up to batch size 24,576 without sacrificing performance.
- Score: 20.94711634514331
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Preserving training dynamics across batch sizes is an important tool for
practical machine learning as it enables the trade-off between batch size and
wall-clock time. This trade-off is typically enabled by a scaling rule, for
example, in stochastic gradient descent, one should scale the learning rate
linearly with the batch size. Another important machine learning tool is the
model EMA, a functional copy of a target model, whose parameters move towards
those of its target model according to an Exponential Moving Average (EMA) at a
rate parameterized by a momentum hyperparameter. This model EMA can improve the
robustness and generalization of supervised learning, stabilize
pseudo-labeling, and provide a learning signal for Self-Supervised Learning
(SSL). Prior works have not considered the optimization of the model EMA when
performing scaling, leading to different training dynamics across batch sizes
and lower model performance. In this work, we provide a scaling rule for
optimization in the presence of a model EMA and demonstrate the rule's validity
across a range of architectures, optimizers, and data modalities. We also show
the rule's validity where the model EMA contributes to the optimization of the
target model, enabling us to train EMA-based pseudo-labeling and SSL methods at
small and large batch sizes. For SSL, we enable training of BYOL up to batch
size 24,576 without sacrificing performance, a 6$\times$ wall-clock time
reduction under idealized hardware settings.
Related papers
- Structuring a Training Strategy to Robustify Perception Models with Realistic Image Augmentations [1.5723316845301678]
This report introduces a novel methodology for training with augmentations to enhance model robustness and performance in such conditions.
We present a comprehensive framework that includes identifying weak spots in Machine Learning models, selecting suitable augmentations, and devising effective training strategies.
Experimental results demonstrate improvements in model performance, as measured by commonly used metrics such as mean Average Precision (mAP) and mean Intersection over Union (mIoU) on open-source object detection and semantic segmentation models and datasets.
arXiv Detail & Related papers (2024-08-30T14:15:48Z) - MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies [85.57899012821211]
Small Language Models (SLMs) are a resource-efficient alternative to Large Language Models (LLMs)
We introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants.
We also introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K.
arXiv Detail & Related papers (2024-04-09T15:36:50Z) - Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning [13.964106147449051]
Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets.
We propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT)
We demonstrate that our new approximations with semantic information are superior to representative capabilities.
arXiv Detail & Related papers (2024-02-04T04:42:05Z) - Asynchronous Multi-Model Dynamic Federated Learning over Wireless
Networks: Theory, Modeling, and Optimization [20.741776617129208]
Federated learning (FL) has emerged as a key technique for distributed machine learning (ML)
We first formulate rectangular scheduling steps and functions to capture the impact of system parameters on learning performance.
Our analysis sheds light on the joint impact of device training variables and asynchronous scheduling decisions.
arXiv Detail & Related papers (2023-05-22T21:39:38Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - Automatic Learning of Subword Dependent Model Scales [50.105894487730545]
We show that the model scales for a combination of attention encoder-decoder acoustic model and language model can be learned as effectively as with manual tuning.
We extend this approach to subword dependent model scales which could not be tuned manually which leads to 7% improvement on LBS and 3% on SWB.
arXiv Detail & Related papers (2021-10-18T13:48:28Z) - Robust MAML: Prioritization task buffer with adaptive learning process
for model-agnostic meta-learning [15.894925018423665]
Model agnostic meta-learning (MAML) is a popular state-of-the-art meta-learning algorithm.
This paper proposes a more robust MAML based on an adaptive learning scheme and a prioritization task buffer.
Experimental results on meta reinforcement learning environments demonstrate a substantial performance gain.
arXiv Detail & Related papers (2021-03-15T09:34:34Z) - Transfer Learning without Knowing: Reprogramming Black-box Machine
Learning Models with Scarce Data and Limited Resources [78.72922528736011]
We propose a novel approach, black-box adversarial reprogramming (BAR), that repurposes a well-trained black-box machine learning model.
Using zeroth order optimization and multi-label mapping techniques, BAR can reprogram a black-box ML model solely based on its input-output responses.
BAR outperforms state-of-the-art methods and yields comparable performance to the vanilla adversarial reprogramming method.
arXiv Detail & Related papers (2020-07-17T01:52:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.