Related papers: Re-parameterizing Your Optimizers rather than Architectures

Re-parameterizing Your Optimizers rather than Architectures

URL: http://arxiv.org/abs/2205.15242v1
Date: Mon, 30 May 2022 16:55:59 GMT
Title: Re-parameterizing Your Optimizers rather than Architectures
Authors: Xiaohan Ding, Honghao Chen, Xiangyu Zhang, Kaiqi Huang, Jungong Han, Guiguang Ding
Abstract summary: We propose a novel paradigm of incorporating model-specific prior knowledge into Structurals and using them to train generic (simple) models. As an implementation, we propose a novel methodology to add prior knowledge by modifying the gradients according to a set of model-specific hyper- parameters. For a simple model trained with a Repr, we focus on a VGG-style plain model and showcase that such a simple model trained with a Repr, which is referred to as Rep-VGG, performs on par with the recent well-designed models.
Score: 119.08740698936633
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The well-designed structures in neural networks reflect the prior knowledge incorporated into the models. However, though different models have various priors, we are used to training them with model-agnostic optimizers (e.g., SGD). In this paper, we propose a novel paradigm of incorporating model-specific prior knowledge into optimizers and using them to train generic (simple) models. As an implementation, we propose a novel methodology to add prior knowledge by modifying the gradients according to a set of model-specific hyper-parameters, which is referred to as Gradient Re-parameterization, and the optimizers are named RepOptimizers. For the extreme simplicity of model structure, we focus on a VGG-style plain model and showcase that such a simple model trained with a RepOptimizer, which is referred to as RepOpt-VGG, performs on par with the recent well-designed models. From a practical perspective, RepOpt-VGG is a favorable base model because of its simple structure, high inference speed and training efficiency. Compared to Structural Re-parameterization, which adds priors into models via constructing extra training-time structures, RepOptimizers require no extra forward/backward computations and solve the problem of quantization. The code and models are publicly available at https://github.com/DingXiaoH/RepOptimizers.

Related papers

Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach [65.6966065843227]
Iterative Reweight-then-IRO is a framework that performs RL-style alignment of a frozen base model without touching its parameters.<n>At test time, the value functions are used to guide the base model generation via a search-based optimization process.<n> Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI's reinforcement fine-tuning (RFT)
arXiv Detail & Related papers (2025-06-21T21:49:02Z)
Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer [17.463052541838504]
Fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy.<n>Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate interference when merging model parameters across tasks.<n>We introduce a novel method called Neural Pruning (NPS-Pruning) for slimming down fine-tuned models.
arXiv Detail & Related papers (2025-05-24T14:27:20Z)
SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models. We propose a novel model fine-tuning method to make full use of these ineffective parameters. Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z)
Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures. We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z)
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction. SMILE allows for the upscaling of source models into an MoE model without extra data or further training. We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z)
Improving generalization in large language models by learning prefix subspaces [5.911540700785975]
This article focuses on large language models (LLMs) fine-tuning in the scarce data regime (also known as the "few-shot" learning setting) We propose a method to increase the generalization capabilities of LLMs based on neural network subspaces.
arXiv Detail & Related papers (2023-10-24T12:44:09Z)
ZhiJian: A Unifying and Rapidly Deployable Toolbox for Pre-trained Model Reuse [59.500060790983994]
This paper introduces ZhiJian, a comprehensive and user-friendly toolbox for model reuse, utilizing the PyTorch backend. ZhiJian presents a novel paradigm that unifies diverse perspectives on model reuse, encompassing target architecture construction with PTM, tuning target model with PTM, and PTM-based inference.
arXiv Detail & Related papers (2023-08-17T19:12:13Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
Exploring and Evaluating Personalized Models for Code Generation [9.25440316608194]
We evaluate transformer model fine-tuning for personalization. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios.
arXiv Detail & Related papers (2022-08-29T23:28:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.