Trainable Weight Averaging: Accelerating Training and Improving Generalization
- URL: http://arxiv.org/abs/2205.13104v4
- Date: Mon, 10 Feb 2025 06:21:42 GMT
- Title: Trainable Weight Averaging: Accelerating Training and Improving Generalization
- Authors: Tao Li, Zhehao Huang, Yingwen Wu, Zhengbao He, Qinghua Tao, Xiaolin Huang, Chih-Jen Lin,
- Abstract summary: We introduce Trainable Weight Averaging (TWA), a novel optimization method that operates within a reduced subspace spanned by candidate weights.<n>TWA offers greater flexibility and can be applied to different training scenarios.<n>For large-scale applications, we develop a distributed training framework that combines parallel computation with low-bit compression.
- Score: 35.17919937007783
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weight averaging is a widely used technique for accelerating training and improving the generalization of deep neural networks (DNNs). While existing approaches like stochastic weight averaging (SWA) rely on pre-set weighting schemes, they can be suboptimal when handling diverse weights. We introduce Trainable Weight Averaging (TWA), a novel optimization method that operates within a reduced subspace spanned by candidate weights and learns optimal weighting coefficients through optimization. TWA offers greater flexibility and can be applied to different training scenarios. For large-scale applications, we develop a distributed training framework that combines parallel computation with low-bit compression for the projection matrix, effectively managing memory and computational demands. TWA can be implemented using either training data (TWA-t) or validation data (TWA-v), with the latter providing more effective averaging. Extensive experiments showcase TWA's advantages: (i) it consistently outperforms SWA in generalization performance and flexibility, (ii) when applied during early training, it reduces training time by over 40\% on CIFAR datasets and 30\% on ImageNet while maintaining comparable performance, and (iii) during fine-tuning, it significantly enhances generalization by weighted averaging of model checkpoints. In summary, we present an efficient and effective framework for trainable weight averaging. The code is available at https://github.com/nblt/TWA.
Related papers
- SeWA: Selective Weight Average via Probabilistic Masking [51.015724517293236]
We show that only a few points are needed to achieve better and faster convergence.
We transform the discrete selection problem into a continuous subset optimization framework.
We derive the SeWA's stability bounds, which are sharper than that under both convex image checkpoints.
arXiv Detail & Related papers (2025-02-14T12:35:21Z) - When, Where and Why to Average Weights? [36.106114687828395]
Averaging checkpoints along the training trajectory is a powerful approach to improve the generalization performance of Machine Learning models.
We show that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost.
arXiv Detail & Related papers (2025-02-10T18:40:48Z) - Adaptive Data Exploitation in Deep Reinforcement Learning [50.53705050673944]
We introduce ADEPT, a powerful framework to enhance the **data efficiency** and **generalization** in deep reinforcement learning (RL)
Specifically, ADEPT adaptively manages the use of sampled data across different learning stages via multi-armed bandit (MAB) algorithms.
We test ADEPT on benchmarks including Procgen, MiniGrid, and PyBullet.
arXiv Detail & Related papers (2025-01-22T04:01:17Z) - AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs [61.13296177652599]
We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales.
We propose AutoScale, a two-stage, scale-aware data composition framework.
arXiv Detail & Related papers (2024-07-29T17:06:30Z) - Training Artificial Neural Networks by Coordinate Search Algorithm [0.20971479389679332]
We propose an efficient version of the gradient-free Coordinate Search (CS) algorithm for training neural networks.
The proposed algorithm can be used with non-differentiable activation functions and tailored to multi-objective/multi-loss problems.
Finding the optimal values for weights of ANNs is a large-scale optimization problem.
arXiv Detail & Related papers (2024-02-20T01:47:25Z) - Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks [69.38572074372392]
We present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks.
Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks.
arXiv Detail & Related papers (2023-07-13T16:39:08Z) - Lookaround Optimizer: $k$ steps around, 1 step average [36.207388029666625]
Weight Average (WA) is an active research topic due to its simplicity in ensembling deep networks and the effectiveness in promoting generalization.
Existing weight average approaches, however, are often carried out along only one training trajectory in a post-hoc manner.
We propose Lookaround, a straightforward yet effective SGD-based generalization leading to flatter minima with better generalization.
arXiv Detail & Related papers (2023-06-13T10:55:20Z) - Diffused Redundancy in Pre-trained Representations [98.55546694886819]
We take a closer look at how features are encoded in pre-trained representations.
We find that learned representations in a given layer exhibit a degree of diffuse redundancy.
Our findings shed light on the nature of representations learned by pre-trained deep neural networks.
arXiv Detail & Related papers (2023-05-31T21:00:50Z) - Efficient Implementation of a Multi-Layer Gradient-Free Online-Trainable
Spiking Neural Network on FPGA [0.31498833540989407]
ODESA is the first network to have end-to-end multi-layer online local supervised training without using gradients.
This research shows that the network architecture and the online training of weights and thresholds can be implemented efficiently on a large scale in hardware.
arXiv Detail & Related papers (2023-05-31T00:34:15Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Distributed Adversarial Training to Robustify Deep Neural Networks at
Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification.
To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training.
We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - Stochastic Weight Averaging Revisited [5.68481425260348]
We show that SWA's performance is highly dependent on to what extent the SGD process that runs before SWA converges.
We show that following an SGD process with insufficient convergence, running SWA more times leads to continual incremental benefits in terms of generalization.
arXiv Detail & Related papers (2022-01-03T08:29:01Z) - Efficient and Robust Mixed-Integer Optimization Methods for Training
Binarized Deep Neural Networks [0.07614628596146598]
We study deep neural networks with binary activation functions and continuous or integer weights (BDNN)
We show that the BDNN can be reformulated as a mixed-integer linear program with bounded weight space which can be solved to global optimality by classical mixed-integer programming solvers.
For the first time a robust model is presented which enforces robustness of the BDNN during training.
arXiv Detail & Related papers (2021-10-21T18:02:58Z) - Exploring the Common Principal Subspace of Deep Features in Neural
Networks [50.37178960258464]
We find that different Deep Neural Networks (DNNs) trained with the same dataset share a common principal subspace in latent spaces.
Specifically, we design a new metric $mathcalP$-vector to represent the principal subspace of deep features learned in a DNN.
Small angles (with cosine close to $1.0$) have been found in the comparisons between any two DNNs trained with different algorithms/architectures.
arXiv Detail & Related papers (2021-10-06T15:48:32Z) - Embedded Knowledge Distillation in Depth-level Dynamic Neural Network [8.207403859762044]
We propose an elegant Depth-level Dynamic Neural Network (DDNN) integrated different-depth sub-nets of similar architectures.
In this article, we design the Embedded-Knowledge-Distillation (EKD) training mechanism for the DDNN to implement semantic knowledge transfer from the teacher (full) net to multiple sub-nets.
Experiments on CIFAR-10, CIFAR-100, and ImageNet datasets demonstrate that sub-nets in DDNN with EKD training achieves better performance than the depth-level pruning or individually training.
arXiv Detail & Related papers (2021-03-01T06:35:31Z) - Learning Neural Network Subspaces [74.44457651546728]
Recent observations have advanced our understanding of the neural network optimization landscape.
With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
arXiv Detail & Related papers (2021-02-20T23:26:58Z) - Training Sparse Neural Networks using Compressed Sensing [13.84396596420605]
We develop and test a novel method based on compressed sensing which combines the pruning and training into a single step.
Specifically, we utilize an adaptively weighted $ell1$ penalty on the weights during training, which we combine with a generalization of the regularized dual averaging (RDA) algorithm in order to train sparse neural networks.
arXiv Detail & Related papers (2020-08-21T19:35:54Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality
Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step.
We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.