Agreement-Based Cascading for Efficient Inference
- URL: http://arxiv.org/abs/2407.02348v2
- Date: Fri, 06 Dec 2024 20:21:37 GMT
- Title: Agreement-Based Cascading for Efficient Inference
- Authors: Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith,
- Abstract summary: Agreement-Based Cascading (ABC) is a simple, effective adaptive inference technique.
ABC builds a cascade of models of increasing size/complexity, and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing.
We show that ABC can reliably act as a drop-in replacement for existing models and surpass the best single model it aims to replace in terms of both efficiency and accuracy.
- Score: 32.914852531806
- License:
- Abstract: Adaptive inference schemes reduce the cost of machine learning inference by assigning smaller models to easier examples, attempting to avoid invocation of larger models when possible. In this work we explore a simple, effective adaptive inference technique we term Agreement-Based Cascading (ABC). ABC builds a cascade of models of increasing size/complexity, and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing. Although ensemble execution introduces additional expense, we show that these costs can be easily offset in practice due to large expected differences in model sizes, parallel inference execution capabilities, and accuracy benefits of ensembling. We examine ABC theoretically and empirically in terms of these parameters, showing that the approach can reliably act as a drop-in replacement for existing models and surpass the best single model it aims to replace in terms of both efficiency and accuracy. Additionally, we explore the performance of ABC relative to existing cascading methods in three common scenarios: (1) edge-to-cloud inference, where ABC reduces communication costs by up to 14x; (2) cloud-based model serving, where it achieves a 3x reduction in rental costs; and (3) inference via model API services, where ABC achieves a 2-25x reduction in average price per token/request relative to state-of-the-art LLM cascades.
Related papers
- Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral [28.382040322550775]
We propose a simple yet effective approach for machine translation using existing quality estimation (QE) metrics as deferral rules.
We show that QE-based deferral allows a cascaded system to match the performance of a larger model while invoking it for a small fraction.
arXiv Detail & Related papers (2025-02-18T10:05:40Z) - FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing [17.01412432658081]
Large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws.
We propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens.
Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods.
arXiv Detail & Related papers (2024-12-16T07:09:46Z) - Faster Cascades via Speculative Decoding [66.16909847419198]
Cascades and speculative decoding are approaches to improving language models' inference efficiency.
We propose new speculative cascading techniques that implement their deferral rule through speculative execution.
We show that our approach yields better cost quality trade-offs than cascading and speculative decoding baselines.
arXiv Detail & Related papers (2024-05-29T16:55:08Z) - Transferable and Principled Efficiency for Open-Vocabulary Segmentation [82.66423763561697]
Recent success of pre-trained foundation vision-language computation models makes Open-Vocabulary (OVS) possible.
This approach introduces heavy computational overheads for two challenges: 1) large model sizes of the backbone; 2) expensive costs during the fine-tuning.
In this paper, we aim to achieve performance that is comparable to or even better than prior OVS works based on large vision-language foundation models.
arXiv Detail & Related papers (2024-04-11T03:08:53Z) - Cabrita: closing the gap for foreign languages [0.0]
The strategy of training the model from scratch in a specific language or domain serves two essential purposes.
Main solution to overcome the cost challenge is to rely on available pre-trained models.
We present a methodology named Cabrita, which successfully addresses the performance and efficient tokenization problem.
arXiv Detail & Related papers (2023-08-23T02:49:35Z) - Systematic compactification of the two-channel Kondo model. II. Comparative study of scaling and universality [44.99833362998488]
We study scaling using Anderson's simple poor man's procedure.
We unveil a universal agreement among the three models in how they flow upon scaling.
arXiv Detail & Related papers (2023-08-07T13:46:45Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges.
We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z) - Quantized Adaptive Subgradient Algorithms and Their Applications [39.103587572626026]
We propose quantized composite mirror descent adaptive subgradient (QCMD adagrad) and quantized regularized dual average adaptive subgradient (QRDA adagrad) for distributed training.
A quantized gradient-based adaptive learning rate matrix is constructed to achieve a balance between communication costs, accuracy, and model sparsity.
arXiv Detail & Related papers (2022-08-11T04:04:03Z) - DualCF: Efficient Model Extraction Attack from Counterfactual
Explanations [57.46134660974256]
Cloud service providers have launched Machine-Learning-as-a-Service platforms to allow users to access large-scale cloudbased models via APIs.
Such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks.
We propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model.
arXiv Detail & Related papers (2022-05-13T08:24:43Z) - ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning.
ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models.
Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.