Related papers: Agreement-Based Cascading for Efficient Inference

Agreement-Based Cascading for Efficient Inference

URL: http://arxiv.org/abs/2407.02348v2
Date: Fri, 06 Dec 2024 20:21:37 GMT
Title: Agreement-Based Cascading for Efficient Inference
Authors: Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith,
Abstract summary: Agreement-Based Cascading (ABC) is a simple, effective adaptive inference technique.<n>ABC builds a cascade of models of increasing size/complexity, and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing.<n>We show that ABC can reliably act as a drop-in replacement for existing models and surpass the best single model it aims to replace in terms of both efficiency and accuracy.
Score: 32.914852531806
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adaptive inference schemes reduce the cost of machine learning inference by assigning smaller models to easier examples, attempting to avoid invocation of larger models when possible. In this work we explore a simple, effective adaptive inference technique we term Agreement-Based Cascading (ABC). ABC builds a cascade of models of increasing size/complexity, and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing. Although ensemble execution introduces additional expense, we show that these costs can be easily offset in practice due to large expected differences in model sizes, parallel inference execution capabilities, and accuracy benefits of ensembling. We examine ABC theoretically and empirically in terms of these parameters, showing that the approach can reliably act as a drop-in replacement for existing models and surpass the best single model it aims to replace in terms of both efficiency and accuracy. Additionally, we explore the performance of ABC relative to existing cascading methods in three common scenarios: (1) edge-to-cloud inference, where ABC reduces communication costs by up to 14x; (2) cloud-based model serving, where it achieves a 3x reduction in rental costs; and (3) inference via model API services, where ABC achieves a 2-25x reduction in average price per token/request relative to state-of-the-art LLM cascades.

Related papers

Collaborative LLM Inference via Planning for Efficient Reasoning [50.04696654679751]
We propose a test-time collaboration framework in which a planner model first generates a plan, defined as a distilled and high-level abstraction of the problem.<n>Small and large models take turns acting as planner and reasoner, exchanging plans in a multi-round cascade to collaboratively solve complex tasks.<n>Our method achieves accuracy comparable to strong proprietary models alone, while significantly reducing reliance on paid inference.
arXiv Detail & Related papers (2025-06-13T08:35:50Z)
Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning [19.258292534503887]
Plan-and-Budget is a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling.<n>Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, tangential -39% token reduction, and +187.5% improvement in $E3$.
arXiv Detail & Related papers (2025-05-22T01:56:29Z)
Bi-directional Model Cascading with Proxy Confidence [3.1890398692194326]
We propose a bi-directional approach to deferral that considers the confidence of small and large models in the cascade simultaneously. We use an analysis of hidden states to improve post-invocation confidence of the small model. We then combine this with a tiny proxy model to estimate pre-invocation confidence of the large model.
arXiv Detail & Related papers (2025-04-27T23:48:14Z)
Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral [28.382040322550775]
We propose a simple yet effective approach for machine translation using existing quality estimation (QE) metrics as deferral rules. We show that QE-based deferral allows a cascaded system to match the performance of a larger model while invoking it for a small fraction.
arXiv Detail & Related papers (2025-02-18T10:05:40Z)
FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing [17.01412432658081]
Large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws. We propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods.
arXiv Detail & Related papers (2024-12-16T07:09:46Z)
Dividable Configuration Performance Learning [4.949726352498762]
We propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed DaL. DaL is based on the new paradigm of dividable learning that builds a model via "divide-and-learn"
arXiv Detail & Related papers (2024-09-11T21:23:23Z)
Faster Cascades via Speculative Decoding [66.16909847419198]
Cascades and speculative decoding are approaches to improving language models' inference efficiency. We propose new speculative cascading techniques that implement their deferral rule through speculative execution. We show that our approach yields better cost quality trade-offs than cascading and speculative decoding baselines.
arXiv Detail & Related papers (2024-05-29T16:55:08Z)
Transferable and Principled Efficiency for Open-Vocabulary Segmentation [82.66423763561697]
Recent success of pre-trained foundation vision-language computation models makes Open-Vocabulary (OVS) possible. This approach introduces heavy computational overheads for two challenges: 1) large model sizes of the backbone; 2) expensive costs during the fine-tuning. In this paper, we aim to achieve performance that is comparable to or even better than prior OVS works based on large vision-language foundation models.
arXiv Detail & Related papers (2024-04-11T03:08:53Z)
Cabrita: closing the gap for foreign languages [0.0]
The strategy of training the model from scratch in a specific language or domain serves two essential purposes. Main solution to overcome the cost challenge is to rely on available pre-trained models. We present a methodology named Cabrita, which successfully addresses the performance and efficient tokenization problem.
arXiv Detail & Related papers (2023-08-23T02:49:35Z)
Systematic compactification of the two-channel Kondo model. II. Comparative study of scaling and universality [44.99833362998488]
We study scaling using Anderson's simple poor man's procedure. We unveil a universal agreement among the three models in how they flow upon scaling.
arXiv Detail & Related papers (2023-08-07T13:46:45Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST) IST is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z)
On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z)
Design and Prototyping Distributed CNN Inference Acceleration in Edge Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing. Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16. It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z)
Slimmable Domain Adaptation [112.19652651687402]
We introduce a simple framework, Slimmable Domain Adaptation, to improve cross-domain generalization with a weight-sharing model bank. Our framework surpasses other competing approaches by a very large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-06-14T06:28:04Z)
DualCF: Efficient Model Extraction Attack from Counterfactual Explanations [57.46134660974256]
Cloud service providers have launched Machine-Learning-as-a-Service platforms to allow users to access large-scale cloudbased models via APIs. Such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks. We propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model.
arXiv Detail & Related papers (2022-05-13T08:24:43Z)
ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning. ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z)
Variational Inference with NoFAS: Normalizing Flow with Adaptive Surrogate for Computationally Expensive Models [7.217783736464403]
Use of sampling-based approaches such as Markov chain Monte Carlo may become intractable when each likelihood evaluation is computationally expensive. New approaches combining variational inference with normalizing flow are characterized by a computational cost that grows only linearly with the dimensionality of the latent variable space. We propose Normalizing Flow with Adaptive Surrogate (NoFAS), an optimization strategy that alternatively updates the normalizing flow parameters and the weights of a neural network surrogate model.
arXiv Detail & Related papers (2021-08-28T14:31:45Z)
Optimal Model Placement and Online Model Splitting for Device-Edge Co-Inference [22.785214118527872]
Device-edge co-inference opens up new possibilities for resource-constrained wireless devices to execute deep neural network (DNN)-based applications. We study the joint optimization of the model placement and online model splitting decisions to minimize the energy-and-time cost of device-edge co-inference.
arXiv Detail & Related papers (2021-05-28T06:55:04Z)
Efficient Marginalization of Discrete and Structured Latent Variables via Sparsity [26.518803984578867]
Training neural network models with discrete (categorical or structured) latent variables can be computationally challenging. One typically resorts to sampling-based approximations of the true marginal. We propose a new training strategy which replaces these estimators by an exact yet efficient marginalization.
arXiv Detail & Related papers (2020-07-03T19:36:35Z)
Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference [119.19779637025444]
Deep networks were recently suggested to face the odds between accuracy (on clean natural images) and robustness (on adversarially perturbed images) This paper studies multi-exit networks associated with input-adaptive inference, showing their strong promise in achieving a "sweet point" in cooptimizing model accuracy, robustness and efficiency.
arXiv Detail & Related papers (2020-02-24T00:40:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.