Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training
- URL: http://arxiv.org/abs/2402.05033v2
- Date: Wed, 20 Nov 2024 19:51:32 GMT
- Title: Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training
- Authors: Hanna Mazzawi, Pranjal Awasthi, Xavi Gonzalvo, Srikumar Ramalingam,
- Abstract summary: Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones.
This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment.
- Score: 32.154166415680066
- License:
- Abstract: Recent breakthroughs and successful deployment of large language and vision models in a constrained environment predominantly follow a two phase approach. First, large models are trained to achieve peak performance, followed by a model shrinking method to meet hardware constraints; Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. Formally, this can be seen as the problem of identifying an optimal model of size $n$ from a larger model of size $k \cdot n$, where $k > 1$ is the overparameterization factor. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment. Our contribution is an effective architectural change, namely, {\it Majority Kernels} that is compatible with the main standard architectures such as multi-layer perceptrons (MLPs), Residual networks (ResNets), and Transformers. We demonstrate that applying our technique can modify the training dynamics resulting in performance gains across architectures and tasks while maintaining the inference performance consistent. Furthermore, our approach adds minimal overhead to the cost incurred (wall clock time) at training time. The proposed approach shows strong performance on a wide variety of datasets and models, even outperforming strong baselines such as distilled ensembles as well as combinatorial optimization methods based on submodular optimization.
Related papers
- HM3: Hierarchical Multi-Objective Model Merging for Pretrained Models [28.993221775758702]
Model merging is a technique that combines multiple large pretrained models into a single model with enhanced performance and broader task adaptability.
This paper marks a significant advance toward more flexible and comprehensive model merging techniques.
We train policy and value networks using offline sampling of weight vectors, which are then employed for the online optimization of merging strategies.
arXiv Detail & Related papers (2024-09-27T16:31:31Z) - Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model [20.054342930450055]
This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models.
PLRD allows for significant reductions in computational overhead and energy consumption.
Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs.
arXiv Detail & Related papers (2024-06-28T15:27:57Z) - Towards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion [53.33473557562837]
Solving multi-objective optimization problems for large deep neural networks is a challenging task due to the complexity of the loss landscape and the expensive computational cost.
We propose a practical and scalable approach to solve this problem via mixture of experts (MoE) based model fusion.
By ensembling the weights of specialized single-task models, the MoE module can effectively capture the trade-offs between multiple objectives.
arXiv Detail & Related papers (2024-06-14T07:16:18Z) - A Lightweight Feature Fusion Architecture For Resource-Constrained Crowd
Counting [3.5066463427087777]
We introduce two lightweight models to enhance the versatility of crowd-counting models.
These models maintain the same downstream architecture while incorporating two distinct backbones: MobileNet and MobileViT.
We leverage Adjacent Feature Fusion to extract diverse scale features from a Pre-Trained Model (PTM) and subsequently combine these features seamlessly.
arXiv Detail & Related papers (2024-01-11T15:13:31Z) - Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks [12.146530928616386]
A common approach for targeted problems involves fine-tuning pre-trained foundation models for specific target tasks.
This work focuses on the problem of merging multiple fine-tunings of the same foundation model derived from a spectrum of auxiliary tasks.
We introduce a new simple method, Model Breadcrumbs, which consists of a sparsely defined weight set that guides model adaptation within the weight space of a pre-trained model.
arXiv Detail & Related papers (2023-12-11T19:10:55Z) - One-stop Training of Multiple Capacity Models [74.87789190840527]
We propose a novel one-stop training framework to jointly train high-capacity and low-capactiy models.
Unlike knowledge distillation, where multiple capacity models are trained from scratch separately, our approach integrates supervisions from different capacity models simultaneously.
arXiv Detail & Related papers (2023-05-23T13:44:09Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.