How to Train Your Multi-Exit Model? Analyzing the Impact of Training Strategies
- URL: http://arxiv.org/abs/2407.14320v2
- Date: Sat, 21 Jun 2025 10:00:51 GMT
- Title: How to Train Your Multi-Exit Model? Analyzing the Impact of Training Strategies
- Authors: Piotr Kubaty, Bartosz Wójcik, Bartłomiej Krzepkowski, Monika Michaluk, Tomasz Trzciński, Jary Pomponi, Kamil Adamczewski,
- Abstract summary: Early exits enable the network's forward pass to terminate early by attaching trainable internal classifiers to the backbone network.<n>Existing early-exit methods typically adopt either a joint training approach, where the backbone and exit heads are trained simultaneously, or a disjoint approach, where the heads are trained separately.<n>This paper introduces a set of metrics to analyze early-exit training dynamics and guide the choice of training strategy.
- Score: 3.1836117900874825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Early exits enable the network's forward pass to terminate early by attaching trainable internal classifiers to the backbone network. Existing early-exit methods typically adopt either a joint training approach, where the backbone and exit heads are trained simultaneously, or a disjoint approach, where the heads are trained separately. However, the implications of this choice are often overlooked, with studies typically adopting one approach without adequate justification. This choice influences training dynamics and its impact remains largely unexplored. In this paper, we introduce a set of metrics to analyze early-exit training dynamics and guide the choice of training strategy. We demonstrate that conventionally used joint and disjoint regimes yield suboptimal performance. To address these limitations, we propose a mixed training strategy: the backbone is trained first, followed by the training of the entire multi-exit network. Through comprehensive evaluations of training strategies across various architectures, datasets, and early-exit methods, we present the strengths and weaknesses of the early exit training strategies. In particular, we show consistent improvements in performance and efficiency using the proposed mixed strategy.
Related papers
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models [73.10315509190623]
Recent reinforcement learning techniques have yielded impressive reasoning improvements in language models.<n>It remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training.<n>We develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training.
arXiv Detail & Related papers (2025-12-08T18:12:10Z) - Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning [11.361171211215597]
We study whether multi-turn training with human feedback is necessary for reasoning tasks.<n>We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations.
arXiv Detail & Related papers (2025-10-24T11:08:32Z) - Boosting Meta-Training with Base Class Information for Few-Shot Learning [35.144099160883606]
We propose an end-to-end training paradigm consisting of two alternative loops.
In the outer loop, we calculate cross entropy loss on the entire training set while updating only the final linear layer.
This training paradigm not only converges quickly but also outperforms existing baselines, indicating that information from the overall training set and the meta-learning training paradigm could mutually reinforce one another.
arXiv Detail & Related papers (2024-03-06T05:13:23Z) - Efficient Stagewise Pretraining via Progressive Subnetworks [53.00045381931778]
The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective when compared to stacking-based approaches.
This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods.
We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork at each step, progressively increasing the size in stages.
arXiv Detail & Related papers (2024-02-08T18:49:09Z) - An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training [11.749347656959822]
We propose a flexible model placement framework that offers two general and agile model placement strategies.
Our framework provides a simple user interface and guidelines to easily and flexibly configure these strategies in various training scenarios.
arXiv Detail & Related papers (2023-12-19T03:24:55Z) - Fast Propagation is Better: Accelerating Single-Step Adversarial
Training via Sampling Subnetworks [69.54774045493227]
A drawback of adversarial training is the computational overhead introduced by the generation of adversarial examples.
We propose to exploit the interior building blocks of the model to improve efficiency.
Compared with previous methods, our method not only reduces the training cost but also achieves better model robustness.
arXiv Detail & Related papers (2023-10-24T01:36:20Z) - Joint Training of Deep Ensembles Fails Due to Learner Collusion [61.557412796012535]
Ensembles of machine learning models have been well established as a powerful method of improving performance over a single model.
Traditionally, ensembling algorithms train their base learners independently or sequentially with the goal of optimizing their joint performance.
We show that directly minimizing the loss of the ensemble appears to rarely be applied in practice.
arXiv Detail & Related papers (2023-01-26T18:58:07Z) - Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
Information [77.80071279597665]
We propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training)
Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation.
arXiv Detail & Related papers (2022-11-17T18:59:49Z) - Learning to Weight Samples for Dynamic Early-exiting Networks [35.03752825893429]
Early exiting is an effective paradigm for improving the inference efficiency of deep networks.
Our work proposes to adopt a weight prediction network to weight the loss of different training samples at each exit.
We show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency.
arXiv Detail & Related papers (2022-09-17T10:46:32Z) - Adversarial Coreset Selection for Efficient Robust Training [11.510009152620666]
We show how selecting a small subset of training data provides a principled approach to reducing the time complexity of robust training.
We conduct extensive experiments to demonstrate that our approach speeds up adversarial training by 2-3 times.
arXiv Detail & Related papers (2022-09-13T07:37:53Z) - Enhancing Adversarial Training with Feature Separability [52.39305978984573]
We introduce a new concept of adversarial training graph (ATG) with which the proposed adversarial training with feature separability (ATFS) enables to boost the intra-class feature similarity and increase inter-class feature variance.
Through comprehensive experiments, we demonstrate that the proposed ATFS framework significantly improves both clean and robust performance.
arXiv Detail & Related papers (2022-05-02T04:04:23Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Class Means as an Early Exit Decision Mechanism [18.300490726072326]
We propose a novel early exit technique based on the class means of samples.
This makes our method particularly useful for neural network training in low-power devices.
arXiv Detail & Related papers (2021-03-01T17:31:55Z) - Consensus Control for Decentralized Deep Learning [72.50487751271069]
Decentralized training of deep learning models enables on-device learning over networks, as well as efficient scaling to large compute clusters.
We show in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart.
Our empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop.
arXiv Detail & Related papers (2021-02-09T13:58:33Z) - Empirical Evaluation of Supervision Signals for Style Transfer Models [44.39622949370144]
In this work we empirically compare the dominant optimization paradigms which provide supervision signals during training.
We find that backtranslation has model-specific limitations, which inhibits training style transfer models.
We also experiment with Minimum Risk Training, a popular technique in the machine translation community, which, to our knowledge, has not been empirically evaluated in the task of style transfer.
arXiv Detail & Related papers (2021-01-15T15:33:30Z) - How Important is the Train-Validation Split in Meta-Learning? [155.5088631672781]
A common practice in meta-learning is to perform a train-validation split (emphtrain-val method) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split.
Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice.
We show that the train-train method can indeed outperform the train-val method, on both simulations and real meta-learning tasks.
arXiv Detail & Related papers (2020-10-12T16:48:42Z) - Efficient Robust Training via Backward Smoothing [125.91185167854262]
Adversarial training is the most effective strategy in defending against adversarial examples.
It suffers from high computational costs due to the iterative adversarial attacks in each training step.
Recent studies show that it is possible to achieve fast Adversarial Training by performing a single-step attack.
arXiv Detail & Related papers (2020-10-03T04:37:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.