The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
- URL: http://arxiv.org/abs/2404.00725v2
- Date: Thu, 25 Jul 2024 11:37:54 GMT
- Title: The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
- Authors: Michael Hassid, Tal Remez, Jonas Gehring, Roy Schwartz, Yossi Adi,
- Abstract summary: It is a common belief that large language models (LLMs) are better than smaller-sized ones.
This begs the question: what happens when both models operate under the same budget?
We analyze code generation LLMs of various sizes and make comparisons such as running a 70B model once vs. generating five outputs from a 13B model.
- Score: 32.0844209512788
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: It is a common belief that large language models (LLMs) are better than smaller-sized ones. However, larger models also require significantly more time and compute during inference. This begs the question: what happens when both models operate under the same budget? (e.g., compute, run-time). To address this question, we analyze code generation LLMs of various sizes and make comparisons such as running a 70B model once vs. generating five outputs from a 13B model. We consider a standard unit-test setup, which can be used to select the correct output from the smaller model. Our findings reveal that the repeated use of smaller models can yield consistent improvements, with gains of up to 15% across five tasks. On the other hand, in scenarios where unit-tests are unavailable, a ranking-based selection of candidates from the smaller model falls short of the performance of a single output from larger ones. Our results highlight the potential of using smaller models instead of larger ones, and the importance of studying approaches for ranking LLM outputs.
Related papers
- How do Scaling Laws Apply to Knowledge Graph Engineering Tasks? The Impact of Model Size on Large Language Model Performance [4.388282062290401]
We explore the model size scaling laws specific to Knowledge Graph Engineering (KGE) tasks.<n>In some cases, plateau or ceiling effects occurred, i.e., the task performance did not change much between a model and the next larger model.<n> Regarding models of the same family, sometimes larger models performed worse than smaller models of the same family.
arXiv Detail & Related papers (2025-05-22T06:21:40Z) - Cross-model Control: Improving Multiple Large Language Models in One-time Training [34.98931804630706]
Cross-model Control (CMC) is a method that improves multiple large language models in one-time training.
Based on this insight, we incorporate a tiny language model with a minimal number of parameters.
We propose a novel token mapping strategy named PM-MinED to make this tiny language model applicable to models with different vocabularies.
arXiv Detail & Related papers (2024-10-23T06:52:09Z) - Nudging: Inference-time Alignment of LLMs via Guided Decoding [18.530367090350605]
Large language models (LLMs) require alignment to effectively and safely follow user instructions.<n>This process requires training an aligned version for every base model, resulting in significant computational overhead.<n>We propose NUDGING, a training-free algorithm that aligns any base model at inference time using a small aligned model.
arXiv Detail & Related papers (2024-10-11T23:24:38Z) - Large Language Model Pruning [0.0]
We suggest a model pruning technique specifically focused on LLMs.
The proposed methodology emphasizes the explainability of deep learning models.
We also explore the difference between pruning on large-scale models vs. pruning on small-scale models.
arXiv Detail & Related papers (2024-05-24T18:22:15Z) - Model Cascading for Code: Reducing Inference Costs with Model Cascading for LLM Based Code Generation [20.445496441396028]
We propose letting each model generate and execute a set of test cases for their solutions, and use the test results as the cascading threshold.
We show that our model cascading strategy reduces computational costs while increases accuracy compared to generating the output with a single model.
arXiv Detail & Related papers (2024-05-24T16:20:04Z) - Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models [56.02275285521847]
We propose to evaluate models using a Panel of LLm evaluators (PoLL)
We find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
arXiv Detail & Related papers (2024-04-29T15:33:23Z) - Skill over Scale: The Case for Medium, Domain-Specific Models for SE [4.2630881518611226]
We show that modestly sized domain-specific models can outperform much larger ones on code labeling tasks.
We train two models: SOBertBase (125M parameters) and SOBertLarge (762M parameters) at a budget of just $374 and $1600 each.
Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
arXiv Detail & Related papers (2023-06-05T21:38:30Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - Specializing Smaller Language Models towards Multi-Step Reasoning [56.78474185485288]
We show that abilities can be distilled down from GPT-3.5 ($ge$ 175B) to T5 variants ($le$ 11B)
We propose model specialization, to specialize the model's ability towards a target task.
arXiv Detail & Related papers (2023-01-30T08:51:19Z) - Predicting on the Edge: Identifying Where a Larger Model Does Better [61.793778186198864]
We show that large models have the largest improvement on examples where the small model is most uncertain.
We show that a switcher model which defers examples to a larger model when a small model is uncertain can achieve striking improvements in performance and resource usage.
arXiv Detail & Related papers (2022-02-15T18:53:14Z) - Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation.
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.