Revisiting Cascaded Ensembles for Efficient Inference
- URL: http://arxiv.org/abs/2407.02348v1
- Date: Tue, 2 Jul 2024 15:14:12 GMT
- Title: Revisiting Cascaded Ensembles for Efficient Inference
- Authors: Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith,
- Abstract summary: A common approach to make machine learning inference more efficient is to use example-specific adaptive schemes.
In this work we study a simple scheme for adaptive inference.
We build a cascade of ensembles (CoE), beginning with resource-efficient models and growing to larger, more expressive models.
- Score: 32.914852531806
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A common approach to make machine learning inference more efficient is to use example-specific adaptive schemes, which route or select models for each example at inference time. In this work we study a simple scheme for adaptive inference. We build a cascade of ensembles (CoE), beginning with resource-efficient models and growing to larger, more expressive models, where ensemble agreement serves as a data-dependent routing criterion. This scheme is easy to incorporate into existing inference pipelines, requires no additional training, and can be used to place models across multiple resource tiers--for instance, serving efficient models at the edge and invoking larger models in the cloud only when necessary. In cases where parallel inference is feasible, we show that CoE can improve accuracy relative to the single best model while reducing the average cost of inference by up to 7x, and provides Pareto-dominate solutions in accuracy and efficiency relative to existing adaptive inference baselines. These savings translate to an over 3x-reduction in total monetary cost when performing inference using a heterogeneous cluster of GPUs. Finally, for edge inference scenarios where portions of the cascade reside at the edge vs. in the cloud, CoE can provide a 14x reduction in communication cost and inference latency without sacrificing accuracy.
Related papers
- Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral [28.382040322550775]
We propose a simple yet effective approach for machine translation using existing quality estimation (QE) metrics as deferral rules.
We show that QE-based deferral allows a cascaded system to match the performance of a larger model while invoking it for a small fraction.
arXiv Detail & Related papers (2025-02-18T10:05:40Z) - FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing [17.01412432658081]
Large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws.
We propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens.
Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods.
arXiv Detail & Related papers (2024-12-16T07:09:46Z) - Faster Cascades via Speculative Decoding [66.16909847419198]
Cascades and speculative decoding are approaches to improving language models' inference efficiency.
We propose new speculative cascading techniques that implement their deferral rule through speculative execution.
We show that our approach yields better cost quality trade-offs than cascading and speculative decoding baselines.
arXiv Detail & Related papers (2024-05-29T16:55:08Z) - Transferable and Principled Efficiency for Open-Vocabulary Segmentation [82.66423763561697]
Recent success of pre-trained foundation vision-language computation models makes Open-Vocabulary (OVS) possible.
This approach introduces heavy computational overheads for two challenges: 1) large model sizes of the backbone; 2) expensive costs during the fine-tuning.
In this paper, we aim to achieve performance that is comparable to or even better than prior OVS works based on large vision-language foundation models.
arXiv Detail & Related papers (2024-04-11T03:08:53Z) - Cabrita: closing the gap for foreign languages [0.0]
The strategy of training the model from scratch in a specific language or domain serves two essential purposes.
Main solution to overcome the cost challenge is to rely on available pre-trained models.
We present a methodology named Cabrita, which successfully addresses the performance and efficient tokenization problem.
arXiv Detail & Related papers (2023-08-23T02:49:35Z) - Systematic compactification of the two-channel Kondo model. II. Comparative study of scaling and universality [44.99833362998488]
We study scaling using Anderson's simple poor man's procedure.
We unveil a universal agreement among the three models in how they flow upon scaling.
arXiv Detail & Related papers (2023-08-07T13:46:45Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges.
We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z) - Quantized Adaptive Subgradient Algorithms and Their Applications [39.103587572626026]
We propose quantized composite mirror descent adaptive subgradient (QCMD adagrad) and quantized regularized dual average adaptive subgradient (QRDA adagrad) for distributed training.
A quantized gradient-based adaptive learning rate matrix is constructed to achieve a balance between communication costs, accuracy, and model sparsity.
arXiv Detail & Related papers (2022-08-11T04:04:03Z) - DualCF: Efficient Model Extraction Attack from Counterfactual
Explanations [57.46134660974256]
Cloud service providers have launched Machine-Learning-as-a-Service platforms to allow users to access large-scale cloudbased models via APIs.
Such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks.
We propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model.
arXiv Detail & Related papers (2022-05-13T08:24:43Z) - ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning.
ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models.
Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.