Re-parameterizing Your Optimizers rather than Architectures
- URL: http://arxiv.org/abs/2205.15242v1
- Date: Mon, 30 May 2022 16:55:59 GMT
- Title: Re-parameterizing Your Optimizers rather than Architectures
- Authors: Xiaohan Ding, Honghao Chen, Xiangyu Zhang, Kaiqi Huang, Jungong Han,
Guiguang Ding
- Abstract summary: We propose a novel paradigm of incorporating model-specific prior knowledge into Structurals and using them to train generic (simple) models.
As an implementation, we propose a novel methodology to add prior knowledge by modifying the gradients according to a set of model-specific hyper- parameters.
For a simple model trained with a Repr, we focus on a VGG-style plain model and showcase that such a simple model trained with a Repr, which is referred to as Rep-VGG, performs on par with the recent well-designed models.
- Score: 119.08740698936633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The well-designed structures in neural networks reflect the prior knowledge
incorporated into the models. However, though different models have various
priors, we are used to training them with model-agnostic optimizers (e.g.,
SGD). In this paper, we propose a novel paradigm of incorporating
model-specific prior knowledge into optimizers and using them to train generic
(simple) models. As an implementation, we propose a novel methodology to add
prior knowledge by modifying the gradients according to a set of model-specific
hyper-parameters, which is referred to as Gradient Re-parameterization, and the
optimizers are named RepOptimizers. For the extreme simplicity of model
structure, we focus on a VGG-style plain model and showcase that such a simple
model trained with a RepOptimizer, which is referred to as RepOpt-VGG, performs
on par with the recent well-designed models. From a practical perspective,
RepOpt-VGG is a favorable base model because of its simple structure, high
inference speed and training efficiency. Compared to Structural
Re-parameterization, which adds priors into models via constructing extra
training-time structures, RepOptimizers require no extra forward/backward
computations and solve the problem of quantization. The code and models are
publicly available at https://github.com/DingXiaoH/RepOptimizers.
Related papers
- SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models.
We propose a novel model fine-tuning method to make full use of these ineffective parameters.
Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z) - Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
Our research explores task-specific model pruning to inform decisions about designing SMoE architectures.
We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Simulated Overparameterization [35.12611686956487]
We introduce a novel paradigm called Simulated Overparametrization ( SOP)
SOP proposes a unique approach to model training and inference, where a model with a significantly larger number of parameters is trained in such a way as a smaller, efficient subset of these parameters is used for the actual computation during inference.
We present a novel, architecture agnostic algorithm called "majority kernels", which seamlessly integrates with predominant architectures, including Transformer models.
arXiv Detail & Related papers (2024-02-07T17:07:41Z) - ZhiJian: A Unifying and Rapidly Deployable Toolbox for Pre-trained Model
Reuse [59.500060790983994]
This paper introduces ZhiJian, a comprehensive and user-friendly toolbox for model reuse, utilizing the PyTorch backend.
ZhiJian presents a novel paradigm that unifies diverse perspectives on model reuse, encompassing target architecture construction with PTM, tuning target model with PTM, and PTM-based inference.
arXiv Detail & Related papers (2023-08-17T19:12:13Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Exploring and Evaluating Personalized Models for Code Generation [9.25440316608194]
We evaluate transformer model fine-tuning for personalization.
We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned.
We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios.
arXiv Detail & Related papers (2022-08-29T23:28:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.