Related papers: Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

URL: http://arxiv.org/abs/2412.19325v1
Date: Thu, 26 Dec 2024 18:54:32 GMT
Title: Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones
Authors: Mehrnaz Mofakhami, Reza Bayat, Ioannis Mitliagkas, Joao Monteiro, Valentina Zantedeschi,
Abstract summary: Early Exiting (EE) is a promising technique for speeding up inference by adaptively allocating compute resources to data points based on their difficulty.<n>We first present a novel perspective on the EE approach, showing that larger models deployed with EE can achieve higher performance than smaller models.<n>We introduce Performance Control Early Exiting (PCEE), a method that enables accuracy thresholding by basing decisions not on a data point's confidence but on the average accuracy of samples.
Score: 17.797465636040087
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Early Exiting (EE) is a promising technique for speeding up inference by adaptively allocating compute resources to data points based on their difficulty. The approach enables predictions to exit at earlier layers for simpler samples while reserving more computation for challenging ones. In this study, we first present a novel perspective on the EE approach, showing that larger models deployed with EE can achieve higher performance than smaller models while maintaining similar computational costs. As existing EE approaches rely on confidence estimation at each exit point, we further study the impact of overconfidence on the controllability of the compute-performance trade-off. We introduce Performance Control Early Exiting (PCEE), a method that enables accuracy thresholding by basing decisions not on a data point's confidence but on the average accuracy of samples with similar confidence levels from a held-out validation set. In our experiments, we show that PCEE offers a simple yet computationally efficient approach that provides better control over performance than standard confidence-based approaches, and allows us to scale up model sizes to yield performance gain while reducing the computational cost.

Related papers

Observationally Informed Adaptive Causal Experimental Design [55.998153710215654]
We propose Active Residual Learning, a new paradigm that leverages the observational model as a foundational prior.<n>This approach shifts the experimental focus from learning target causal quantities from scratch to efficiently estimating the residuals required to correct observational bias.<n> Experiments on synthetic and semi-synthetic benchmarks demonstrate that R-Design significantly outperforms baselines.
arXiv Detail & Related papers (2026-03-04T06:52:37Z)
Fault-Tolerant Evaluation for Sample-Efficient Model Performance Estimators [13.227055178509524]
We propose a fault-tolerant evaluation framework that integrates bias and variance considerations within an adjustable tolerance level.<n>We show that proper calibration of $varepsilon$ ensures reliable evaluation across different variance regimes.<n> Experiments on real-world datasets demonstrate that our framework provides comprehensive and actionable insights into estimator behavior.
arXiv Detail & Related papers (2026-02-06T22:14:46Z)
EntroCut: Entropy-Guided Adaptive Truncation for Efficient Chain-of-Thought Reasoning in Small-scale Large Reasoning Models [42.49934375597466]
Large Reasoning Models (LRMs) excel at complex reasoning tasks through extended chain-of-thought generation.<n>We find that the entropy of the model's output distribution in early reasoning steps reliably distinguishes correct from incorrect reasoning.<n>We propose EntroCut, a training-free method that dynamically truncates reasoning by identifying high-confidence states.
arXiv Detail & Related papers (2026-01-30T06:19:16Z)
Anytime Safe PAC Efficient Reasoning [8.618430092165498]
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks but suffer from high computational costs and latency.<n>We propose Betting Probably Approximately Correct (B-PAC) reasoning, a principled method that enables anytime safe and efficient online reasoning under partial feedback.
arXiv Detail & Related papers (2026-01-30T01:30:17Z)
PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning [6.050409262589219]
We propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling.<n>Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts during training.<n>Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.
arXiv Detail & Related papers (2025-08-28T09:18:26Z)
Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things [11.802983172874901]
The implementation of machine learning in Internet of Things devices poses significant operational challenges due to limited energy and computation resources. We present adaptive resolution inference (ARI), a novel approach that enables to evaluate new tradeoffs between energy dissipation and model performance.
arXiv Detail & Related papers (2024-08-26T16:00:26Z)
Source-Free Domain-Invariant Performance Prediction [68.39031800809553]
We propose a source-free approach centred on uncertainty-based estimation, using a generative model for calibration in the absence of source data. Our experiments on benchmark object recognition datasets reveal that existing source-based methods fall short with limited source sample availability. Our approach significantly outperforms the current state-of-the-art source-free and source-based methods, affirming its effectiveness in domain-invariant performance estimation.
arXiv Detail & Related papers (2024-08-05T03:18:58Z)
A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation [17.351089059392674]
We propose a framework for model evaluation that includes stratification, sampling, and estimation components. We show that stratification via k-means clustering based on accurate predictions of model performance yields efficient estimators. We also find that model-assisted estimators, which leverage predictions of model accuracy on the unlabeled portion of the dataset, are generally more efficient than the traditional estimates.
arXiv Detail & Related papers (2024-06-11T14:49:04Z)
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.<n>We propose a method called Stratified Prediction-Powered Inference (StratPPI)<n>We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z)
PAC-Bayes Generalization Certificates for Learned Inductive Conformal Prediction [27.434939269672288]
We use PAC-Bayes theory to obtain generalization bounds on the coverage and the efficiency of set-valued predictors. We leverage these theoretical results to provide a practical algorithm for using calibration data to fine-tune the parameters of a model and score function. We evaluate the approach on regression and classification tasks, and outperform baselines calibrated using a Hoeffding bound-based PAC guarantee on ICP.
arXiv Detail & Related papers (2023-12-07T19:40:44Z)
Predicting Out-of-Distribution Error with Confidence Optimal Transport [17.564313038169434]
We present a simple yet effective method to predict a model's performance on an unknown distribution without any addition annotation. We show that our method, Confidence Optimal Transport (COT), provides robust estimates of a model's performance on a target domain. Despite its simplicity, our method achieves state-of-the-art results on three benchmark datasets and outperforms existing methods by a large margin.
arXiv Detail & Related papers (2023-02-10T02:27:13Z)
Energy-Efficient Adaptive Machine Learning on IoT End-Nodes With Class-Dependent Confidence [22.225875583595027]
An effective way to obtain energy-efficiency with small accuracy drops is to sequentially execute a set of increasingly complex models. Current methods employ a single threshold on the output probabilities produced by each model. We show that our method can significantly reduce the energy consumption compared to the single-threshold approach.
arXiv Detail & Related papers (2022-04-07T13:22:52Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Sample-Efficient Reinforcement Learning via Conservative Model-Based Actor-Critic [67.00475077281212]
Model-based reinforcement learning algorithms are more sample efficient than their model-free counterparts. We propose a novel approach that achieves high sample efficiency without the strong reliance on accurate learned models. We show that CMBAC significantly outperforms state-of-the-art approaches in terms of sample efficiency on several challenging tasks.
arXiv Detail & Related papers (2021-12-16T15:33:11Z)
DEALIO: Data-Efficient Adversarial Learning for Imitation from Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator. Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms. This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk. We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z)
BERT Loses Patience: Fast and Robust Inference with Early Exit [91.26199404912019]
We propose Patience-based Early Exit as a plug-and-play technique to improve the efficiency and robustness of a pretrained language model. Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers.
arXiv Detail & Related papers (2020-06-07T13:38:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.