Tangent Model Composition for Ensembling and Continual Fine-tuning
- URL: http://arxiv.org/abs/2307.08114v2
- Date: Sat, 30 Sep 2023 02:37:27 GMT
- Title: Tangent Model Composition for Ensembling and Continual Fine-tuning
- Authors: Tian Yu Liu and Stefano Soatto
- Abstract summary: Tangent Model Composition (TMC) is a method to combine component models independently fine-tuned around a pre-trained point.
TMC improves accuracy by 4.2% compared to ensembling non-linearly fine-tuned models.
- Score: 69.92177580782929
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tangent Model Composition (TMC) is a method to combine component models
independently fine-tuned around a pre-trained point. Component models are
tangent vectors to the pre-trained model that can be added, scaled, or
subtracted to support incremental learning, ensembling, or unlearning.
Component models are composed at inference time via scalar combination,
reducing the cost of ensembling to that of a single model. TMC improves
accuracy by 4.2% compared to ensembling non-linearly fine-tuned models at a
2.5x to 10x reduction of inference cost, growing linearly with the number of
component models. Each component model can be forgotten at zero cost, with no
residual effect on the resulting inference. When used for continual
fine-tuning, TMC is not constrained by sequential bias and can be executed in
parallel on federated data. TMC outperforms recently published continual
fine-tuning methods almost uniformly on each setting -- task-incremental,
class-incremental, and data-incremental -- on a total of 13 experiments across
3 benchmark datasets, despite not using any replay buffer. TMC is designed for
composing models that are local to a pre-trained embedding, but could be
extended to more general settings. The code is available at:
https://github.com/tianyu139/tangent-model-composition
Related papers
- ModelMix: A New Model-Mixup Strategy to Minimize Vicinal Risk across Tasks for Few-scribble based Cardiac Segmentation [32.19827368497988]
We introduce a new approach to few-scribble supervised segmentation based on model parameter, termed as ModelMix.
ModelMix constructs virtual models using convex combinations of convolutional parameters from separate encoders.
We then regularize the model set to minimize vicinal risk across tasks in both unsupervised and scribble-supervised way.
arXiv Detail & Related papers (2024-06-19T05:58:11Z) - EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - Reusing Pretrained Models by Multi-linear Operators for Efficient
Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources.
Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model.
We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - Efficient GPT Model Pre-training using Tensor Train Matrix
Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z) - Tight Integrated End-to-End Training for Cascaded Speech Translation [40.76367623739673]
A cascaded speech translation model relies on discrete and non-differentiable transcription.
Direct speech translation is an alternative method to avoid error propagation.
This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model.
arXiv Detail & Related papers (2020-11-24T15:43:49Z) - Ensemble Distillation for Robust Model Fusion in Federated Learning [72.61259487233214]
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model.
In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side.
We propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients.
arXiv Detail & Related papers (2020-06-12T14:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.