Masks, Signs, And Learning Rate Rewinding
- URL: http://arxiv.org/abs/2402.19262v1
- Date: Thu, 29 Feb 2024 15:32:02 GMT
- Title: Masks, Signs, And Learning Rate Rewinding
- Authors: Advait Gadhikar and Rebekka Burkholz
- Abstract summary: Learning Rate Rewinding (LRR) has been established as a strong variant of Iterative Magnitude Pruning (IMP)
We conduct experiments that disentangle the effect of mask learning and parameter optimization.
In support of this hypothesis, we prove in a simplified single hidden neuron setting that LRR succeeds in more cases than IMP.
- Score: 21.245849787139655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning Rate Rewinding (LRR) has been established as a strong variant of
Iterative Magnitude Pruning (IMP) to find lottery tickets in deep
overparameterized neural networks. While both iterative pruning schemes couple
structure and parameter learning, understanding how LRR excels in both aspects
can bring us closer to the design of more flexible deep learning algorithms
that can optimize diverse sets of sparse architectures. To this end, we conduct
experiments that disentangle the effect of mask learning and parameter
optimization and how both benefit from overparameterization. The ability of LRR
to flip parameter signs early and stay robust to sign perturbations seems to
make it not only more effective in mask identification but also in optimizing
diverse sets of masks, including random ones. In support of this hypothesis, we
prove in a simplified single hidden neuron setting that LRR succeeds in more
cases than IMP, as it can escape initially problematic sign configurations.
Related papers
- Single-Layer Learnable Activation for Implicit Neural Representation (SL$^{2}$A-INR) [6.572456394600755]
Implicit Representation (INR) leveraging a neural network to transform coordinate input into corresponding attributes has driven significant advances in vision-related domains.
We propose SL$2$A-INR with a single-layer learnable activation function, prompting the effectiveness of traditional ReLU-baseds.
Our method performs superior across diverse tasks, including image representation, 3D shape reconstruction, single image super-resolution, CT reconstruction, and novel view.
arXiv Detail & Related papers (2024-09-17T02:02:15Z) - MLAE: Masked LoRA Experts for Visual Parameter-Efficient Fine-Tuning [45.93128932828256]
Masked LoRA Experts (MLAE) is an innovative approach that applies the concept of masking to visual PEFT.
Our method incorporates a cellular decomposition strategy that transforms a low-rank matrix into independent rank-1 submatrices.
We show that MLAE achieves new state-of-the-art (SOTA) performance with an average accuracy score of 78.8% on the VTAB-1k benchmark and 90.9% on the FGVC benchmark.
arXiv Detail & Related papers (2024-05-29T08:57:23Z) - Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning [17.638387297838936]
Fine-tuning large language models (LLM) can be costly.
PEFT addresses the problems by training a fraction of the parameters, whose success reveals the expressiveness and flexibility of pretrained models.
This paper studies the limit of PEFT, by further simplifying its design and reducing the number of trainable parameters beyond standard setups.
We show that Random Masking is surprisingly effective: with a larger-than-expected learning rate, Random Masking can match the performance of standard PEFT algorithms on various tasks, using fewer trainable parameters.
arXiv Detail & Related papers (2024-05-04T07:44:18Z) - Meta-Learning Adversarial Bandit Algorithms [55.72892209124227]
We study online meta-learning with bandit feedback.
We learn to tune online mirror descent generalization (OMD) with self-concordant barrier regularizers.
arXiv Detail & Related papers (2023-07-05T13:52:10Z) - Unsupervised Learning of Initialization in Deep Neural Networks via
Maximum Mean Discrepancy [74.34895342081407]
We propose an unsupervised algorithm to find good initialization for input data.
We first notice that each parameter configuration in the parameter space corresponds to one particular downstream task of d-way classification.
We then conjecture that the success of learning is directly related to how diverse downstream tasks are in the vicinity of the initial parameters.
arXiv Detail & Related papers (2023-02-08T23:23:28Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Adaptive neighborhood Metric learning [184.95321334661898]
We propose a novel distance metric learning algorithm, named adaptive neighborhood metric learning (ANML)
ANML can be used to learn both the linear and deep embeddings.
The emphlog-exp mean function proposed in our method gives a new perspective to review the deep metric learning methods.
arXiv Detail & Related papers (2022-01-20T17:26:37Z) - Understanding Self-supervised Learning with Dual Deep Networks [74.92916579635336]
We propose a novel framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks.
We prove that in each SGD update of SimCLR with various loss functions, the weights at each layer are updated by a emphcovariance operator.
To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a emphhierarchical latent tree model (HLTM)
arXiv Detail & Related papers (2020-10-01T17:51:49Z) - Towards Minimax Optimal Reinforcement Learning in Factored Markov
Decision Processes [53.72166325215299]
We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs)
First one achieves minimax optimal regret guarantees for a rich class of factored structures.
Second one enjoys better computational complexity with a slightly worse regret.
arXiv Detail & Related papers (2020-06-24T00:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.