Model Cascading for Code: Reducing Inference Costs with Model Cascading for LLM Based Code Generation
- URL: http://arxiv.org/abs/2405.15842v1
- Date: Fri, 24 May 2024 16:20:04 GMT
- Title: Model Cascading for Code: Reducing Inference Costs with Model Cascading for LLM Based Code Generation
- Authors: Boyuan Chen, Mingzhi Zhu, Brendan Dolan-Gavitt, Muhammad Shafique, Siddharth Garg,
- Abstract summary: We propose letting each model generate and execute a set of test cases for their solutions, and use the test results as the cascading threshold.
We show that our model cascading strategy reduces computational costs while increases accuracy compared to generating the output with a single model.
- Score: 20.445496441396028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid development of large language models (LLMs) has led to significant advancements in code completion tasks. While larger models have higher accuracy, they also cost much more to run. Meanwhile, model cascading has been proven effective to conserve computational resources while enhancing accuracy in LLMs on natural language generation tasks. It generates output with the smallest model in a set, and only queries the larger models when it fails to meet predefined quality criteria. However, this strategy has not been used in code completion tasks, primarily because assessing the quality of code completions differs substantially from assessing natural language, where the former relies heavily on the functional correctness. To address this, we propose letting each model generate and execute a set of test cases for their solutions, and use the test results as the cascading threshold. We show that our model cascading strategy reduces computational costs while increases accuracy compared to generating the output with a single model. We also introduce a heuristics to determine the optimal combination of the number of solutions, test cases, and test lines each model should generate, based on the budget. Compared to speculative decoding, our method works on black-box models, having the same level of cost-accuracy trade-off, yet providing much more choices based on the server's budget. Ours is the first work to optimize cost-accuracy trade-off for LLM code generation with model cascading.
Related papers
- Decoding-Time Language Model Alignment with Multiple Objectives [116.42095026960598]
Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives.
Here, we propose $textbfmulti-objective decoding (MOD)$, a decoding-time algorithm that outputs the next token from a linear combination of predictions.
We show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method.
arXiv Detail & Related papers (2024-06-27T02:46:30Z) - Improving Large Models with Small models: Lower Costs and Better Performance [81.55672406002715]
We propose Data Shunt$+$ (DS$+$), a general paradigm for collaboration of small and large models.
For instance, ChatGPT achieves an accuracy of $94.43%$ on Amazon Product sentiment analysis, and DS$+$ achieves an accuracy of $95.64%$, while the cost has been reduced to only $31.18%$.
arXiv Detail & Related papers (2024-06-15T14:44:43Z) - Enhancing Code Generation Performance of Smaller Models by Distilling the Reasoning Ability of LLMs [36.409470894115074]
We propose the CodePLAN framework, which aims to transfer LLMs' code generation reasoning capabilities to smaller models.
Our approach improves the smaller model's code generation performance by over 130% on the challenging APPS benchmark.
arXiv Detail & Related papers (2024-03-20T03:09:54Z) - $C^3$: Confidence Calibration Model Cascade for Inference-Efficient
Cross-Lingual Natural Language Understanding [28.853593305486832]
Cross-lingual natural language understanding (NLU) is a critical task in natural language processing (NLP)
Recent advancements have seen multilingual pre-trained language models (mPLMs) significantly enhance the performance of these tasks.
Existing model cascade methods seek to enhance inference efficiency by greedily selecting the lightest model capable of processing the current input from a variety of models.
arXiv Detail & Related papers (2024-02-25T05:07:56Z) - Cascade Speculative Drafting for Even Faster LLM Inference [25.642604897018852]
Speculative decoding improves the efficiency of large language model (LLM) inference.
We introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades.
CS Drafting achieves up to an 81 percent additional speedup over speculative decoding in our experiments.
arXiv Detail & Related papers (2023-12-18T18:59:46Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Modeling Choice via Self-Attention [8.394221523847325]
We show that our attention-based choice model is a low-optimal generalization of the Halo Multinomial Logit (Halo-MNL) model.
We also establish the first realistic-scale benchmark for choice estimation on real data, conducting an evaluation of existing models.
arXiv Detail & Related papers (2023-11-11T11:13:07Z) - Quality-Aware Translation Models: Efficient Generation and Quality Estimation in a Single Model [77.19693792957614]
We propose to make neural machine translation (NMT) models quality-aware by training them to estimate the quality of their own output.
We obtain quality gains similar or even superior to quality reranking approaches, but with the efficiency of single pass decoding.
arXiv Detail & Related papers (2023-10-10T15:33:51Z) - Pruning Large Language Models via Accuracy Predictor [0.0]
Large language models (LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks.
We propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor.
arXiv Detail & Related papers (2023-09-18T06:38:24Z) - Precision-Recall Divergence Optimization for Generative Modeling with
GANs and Normalizing Flows [54.050498411883495]
We develop a novel training method for generative models, such as Generative Adversarial Networks and Normalizing Flows.
We show that achieving a specified precision-recall trade-off corresponds to minimizing a unique $f$-divergence from a family we call the textitPR-divergences.
Our approach improves the performance of existing state-of-the-art models like BigGAN in terms of either precision or recall when tested on datasets such as ImageNet.
arXiv Detail & Related papers (2023-05-30T10:07:17Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.