Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing
- URL: http://arxiv.org/abs/2405.15842v2
- Date: Thu, 13 Feb 2025 20:41:01 GMT
- Title: Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing
- Authors: Boyuan Chen, Mingzhi Zhu, Brendan Dolan-Gavitt, Muhammad Shafique, Siddharth Garg,
- Abstract summary: We introduce a novel framework combining model cascading and inference-time self-testing algorithms to find multiple near-optimal self-testing options on the cost-accuracy tradeoff.
Our approach leverages self-generated tests to both enhance accuracy and evaluate model cascading decisions.
Experimental results show that our cascading approach reduces costs by an average of 26%, and up to 70% in the best case.
- Score: 20.445496441396028
- License:
- Abstract: The rapid advancement of large language models (LLMs) has significantly improved code completion tasks, yet the trade-off between accuracy and computational cost remains a critical challenge. While using larger models and incorporating inference-time self-testing algorithms can significantly improve output accuracy, they incur substantial computational expenses at the same time. Furthermore, servers in real-world scenarios usually have a dynamic preference on the cost-accuracy tradeoff, depending on the budget, bandwidth, the concurrent user volume, and users' sensitivity to wrong answers. In this work, we introduce a novel framework combining model cascading and inference-time self-feedback algorithms to find multiple near-optimal self-testing options on the cost-accuracy tradeoff in LLM-based code generation. Our approach leverages self-generated tests to both enhance accuracy and evaluate model cascading decisions. As a blackbox inference-time method, it requires no access to internal model parameters. We further propose a threshold-based algorithm to determine when to deploy larger models and a heuristic to optimize the number of solutions, test cases, and test lines generated per model, based on budget constraints. Experimental results show that our cascading approach reduces costs by an average of 26%, and up to 70% in the best case, across various model families and datasets, while maintaining or improving accuracy in natural language generation tasks compared to both random and optimal single-model self-testing schemes. To our knowledge, this is the first work to provide a series of choices for optimizing the cost-accuracy trade-off in LLM code generation with self-testing.
Related papers
- Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)
RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.
RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - Reqo: A Robust and Explainable Query Optimization Cost Model [2.184775414778289]
We propose a tree model architecture based on Bidirectional Graph Neural Networks (Bi-GNN) aggregated by Gated Recurrent Units (GRUs)
We implement a novel learning-to-rank cost model that effectively quantifies the uncertainty in cost estimates using approximate probabilistic ML.
In addition, we propose the first explainability technique specifically designed for learning-based cost models.
arXiv Detail & Related papers (2025-01-29T04:48:51Z) - A hybrid framework for effective and efficient machine unlearning [12.499101994047862]
Machine unlearning (MU) is proposed to remove the imprints of revoked samples from the already trained model parameters.
We present a novel hybrid strategy on top of them to achieve an overall success.
arXiv Detail & Related papers (2024-12-19T03:59:26Z) - Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU.
As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z) - Towards Stable Machine Learning Model Retraining via Slowly Varying Sequences [6.067007470552307]
We propose a model-agnostic framework for finding sequences of models that are stable across retraining iterations.
We develop a mixed-integer optimization formulation that is guaranteed to recover optimal models.
We find that, on average, a 2% reduction in predictive power leads to a 30% improvement in stability.
arXiv Detail & Related papers (2024-03-28T22:45:38Z) - Precision-Recall Divergence Optimization for Generative Modeling with
GANs and Normalizing Flows [54.050498411883495]
We develop a novel training method for generative models, such as Generative Adversarial Networks and Normalizing Flows.
We show that achieving a specified precision-recall trade-off corresponds to minimizing a unique $f$-divergence from a family we call the textitPR-divergences.
Our approach improves the performance of existing state-of-the-art models like BigGAN in terms of either precision or recall when tested on datasets such as ImageNet.
arXiv Detail & Related papers (2023-05-30T10:07:17Z) - Modeling the Second Player in Distributionally Robust Optimization [90.25995710696425]
We argue for the use of neural generative models to characterize the worst-case distribution.
This approach poses a number of implementation and optimization challenges.
We find that the proposed approach yields models that are more robust than comparable baselines.
arXiv Detail & Related papers (2021-03-18T14:26:26Z) - AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective.
We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.