Related papers: Revisiting Cascaded Ensembles for Efficient Inference

Revisiting Cascaded Ensembles for Efficient Inference

URL: http://arxiv.org/abs/2407.02348v1
Date: Tue, 2 Jul 2024 15:14:12 GMT
Title: Revisiting Cascaded Ensembles for Efficient Inference
Authors: Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith,
Abstract summary: A common approach to make machine learning inference more efficient is to use example-specific adaptive schemes. In this work we study a simple scheme for adaptive inference. We build a cascade of ensembles (CoE), beginning with resource-efficient models and growing to larger, more expressive models.
Score: 32.914852531806
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A common approach to make machine learning inference more efficient is to use example-specific adaptive schemes, which route or select models for each example at inference time. In this work we study a simple scheme for adaptive inference. We build a cascade of ensembles (CoE), beginning with resource-efficient models and growing to larger, more expressive models, where ensemble agreement serves as a data-dependent routing criterion. This scheme is easy to incorporate into existing inference pipelines, requires no additional training, and can be used to place models across multiple resource tiers--for instance, serving efficient models at the edge and invoking larger models in the cloud only when necessary. In cases where parallel inference is feasible, we show that CoE can improve accuracy relative to the single best model while reducing the average cost of inference by up to 7x, and provides Pareto-dominate solutions in accuracy and efficiency relative to existing adaptive inference baselines. These savings translate to an over 3x-reduction in total monetary cost when performing inference using a heterogeneous cluster of GPUs. Finally, for edge inference scenarios where portions of the cascade reside at the edge vs. in the cloud, CoE can provide a 14x reduction in communication cost and inference latency without sacrificing accuracy.

Related papers

Collaborative LLM Inference via Planning for Efficient Reasoning [50.04696654679751]
We propose a test-time collaboration framework in which a planner model first generates a plan, defined as a distilled and high-level abstraction of the problem.<n>Small and large models take turns acting as planner and reasoner, exchanging plans in a multi-round cascade to collaboratively solve complex tasks.<n>Our method achieves accuracy comparable to strong proprietary models alone, while significantly reducing reliance on paid inference.
arXiv Detail & Related papers (2025-06-13T08:35:50Z)
Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning [19.258292534503887]
Plan-and-Budget is a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling.<n>Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, tangential -39% token reduction, and +187.5% improvement in $E3$.
arXiv Detail & Related papers (2025-05-22T01:56:29Z)
Bi-directional Model Cascading with Proxy Confidence [3.1890398692194326]
We propose a bi-directional approach to deferral that considers the confidence of small and large models in the cascade simultaneously. We use an analysis of hidden states to improve post-invocation confidence of the small model. We then combine this with a tiny proxy model to estimate pre-invocation confidence of the large model.
arXiv Detail & Related papers (2025-04-27T23:48:14Z)
Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral [28.382040322550775]
We propose a simple yet effective approach for machine translation using existing quality estimation (QE) metrics as deferral rules. We show that QE-based deferral allows a cascaded system to match the performance of a larger model while invoking it for a small fraction.
arXiv Detail & Related papers (2025-02-18T10:05:40Z)
FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing [17.01412432658081]
Large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws. We propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods.
arXiv Detail & Related papers (2024-12-16T07:09:46Z)
Dividable Configuration Performance Learning [4.949726352498762]
We propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed DaL. DaL is based on the new paradigm of dividable learning that builds a model via "divide-and-learn"
arXiv Detail & Related papers (2024-09-11T21:23:23Z)
Faster Cascades via Speculative Decoding [66.16909847419198]
Cascades and speculative decoding are approaches to improving language models' inference efficiency. We propose new speculative cascading techniques that implement their deferral rule through speculative execution. We show that our approach yields better cost quality trade-offs than cascading and speculative decoding baselines.
arXiv Detail & Related papers (2024-05-29T16:55:08Z)
Transferable and Principled Efficiency for Open-Vocabulary Segmentation [82.66423763561697]
Recent success of pre-trained foundation vision-language computation models makes Open-Vocabulary (OVS) possible. This approach introduces heavy computational overheads for two challenges: 1) large model sizes of the backbone; 2) expensive costs during the fine-tuning. In this paper, we aim to achieve performance that is comparable to or even better than prior OVS works based on large vision-language foundation models.
arXiv Detail & Related papers (2024-04-11T03:08:53Z)
Cabrita: closing the gap for foreign languages [0.0]
The strategy of training the model from scratch in a specific language or domain serves two essential purposes. Main solution to overcome the cost challenge is to rely on available pre-trained models. We present a methodology named Cabrita, which successfully addresses the performance and efficient tokenization problem.
arXiv Detail & Related papers (2023-08-23T02:49:35Z)
Systematic compactification of the two-channel Kondo model. II. Comparative study of scaling and universality [44.99833362998488]
We study scaling using Anderson's simple poor man's procedure. We unveil a universal agreement among the three models in how they flow upon scaling.
arXiv Detail & Related papers (2023-08-07T13:46:45Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST) IST is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z)
On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z)
Design and Prototyping Distributed CNN Inference Acceleration in Edge Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing. Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16. It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z)
Slimmable Domain Adaptation [112.19652651687402]
We introduce a simple framework, Slimmable Domain Adaptation, to improve cross-domain generalization with a weight-sharing model bank. Our framework surpasses other competing approaches by a very large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-06-14T06:28:04Z)
DualCF: Efficient Model Extraction Attack from Counterfactual Explanations [57.46134660974256]
Cloud service providers have launched Machine-Learning-as-a-Service platforms to allow users to access large-scale cloudbased models via APIs. Such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks. We propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model.
arXiv Detail & Related papers (2022-05-13T08:24:43Z)
ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning. ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z)
Variational Inference with NoFAS: Normalizing Flow with Adaptive Surrogate for Computationally Expensive Models [7.217783736464403]
Use of sampling-based approaches such as Markov chain Monte Carlo may become intractable when each likelihood evaluation is computationally expensive. New approaches combining variational inference with normalizing flow are characterized by a computational cost that grows only linearly with the dimensionality of the latent variable space. We propose Normalizing Flow with Adaptive Surrogate (NoFAS), an optimization strategy that alternatively updates the normalizing flow parameters and the weights of a neural network surrogate model.
arXiv Detail & Related papers (2021-08-28T14:31:45Z)
Optimal Model Placement and Online Model Splitting for Device-Edge Co-Inference [22.785214118527872]
Device-edge co-inference opens up new possibilities for resource-constrained wireless devices to execute deep neural network (DNN)-based applications. We study the joint optimization of the model placement and online model splitting decisions to minimize the energy-and-time cost of device-edge co-inference.
arXiv Detail & Related papers (2021-05-28T06:55:04Z)
Efficient Marginalization of Discrete and Structured Latent Variables via Sparsity [26.518803984578867]
Training neural network models with discrete (categorical or structured) latent variables can be computationally challenging. One typically resorts to sampling-based approximations of the true marginal. We propose a new training strategy which replaces these estimators by an exact yet efficient marginalization.
arXiv Detail & Related papers (2020-07-03T19:36:35Z)
Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference [119.19779637025444]
Deep networks were recently suggested to face the odds between accuracy (on clean natural images) and robustness (on adversarially perturbed images) This paper studies multi-exit networks associated with input-adaptive inference, showing their strong promise in achieving a "sweet point" in cooptimizing model accuracy, robustness and efficiency.
arXiv Detail & Related papers (2020-02-24T00:40:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.