When Neural Code Completion Models Size up the Situation: Attaining
Cheaper and Faster Completion through Dynamic Model Inference
- URL: http://arxiv.org/abs/2401.09964v1
- Date: Thu, 18 Jan 2024 13:26:53 GMT
- Title: When Neural Code Completion Models Size up the Situation: Attaining
Cheaper and Faster Completion through Dynamic Model Inference
- Authors: Zhensu Sun, Xiaoning Du, Fu Song, Shangwen Wang, Li Li
- Abstract summary: We propose a novel dynamic inference method specifically tailored for code completion models.
It can averagely skip 1.7 layers out of 16 layers in the models, leading to an 11.2% speedup with only a marginal 1.1% reduction in ROUGE-L.
- Score: 11.704110756342212
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Leveraging recent advancements in large language models, modern neural code
completion models have demonstrated the capability to generate highly accurate
code suggestions. However, their massive size poses challenges in terms of
computational costs and environmental impact, hindering their widespread
adoption in practical scenarios. Dynamic inference emerges as a promising
solution, as it allocates minimal computation during inference while
maintaining the model's performance. In this research, we explore dynamic
inference within the context of code completion. Initially, we conducted an
empirical investigation on GPT-2, focusing on the inference capabilities of
intermediate layers for code completion. We found that 54.4% of tokens can be
accurately generated using just the first layer, signifying significant
computational savings potential. Moreover, despite using all layers, the model
still fails to predict 14.5% of tokens correctly, and the subsequent
completions continued from them are rarely considered helpful, with only a 4.2%
Acceptance Rate. These findings motivate our exploration of dynamic inference
in code completion and inspire us to enhance it with a decision-making
mechanism that stops the generation of incorrect code. We thus propose a novel
dynamic inference method specifically tailored for code completion models. This
method aims not only to produce correct predictions with largely reduced
computation but also to prevent incorrect predictions proactively. Our
extensive evaluation shows that it can averagely skip 1.7 layers out of 16
layers in the models, leading to an 11.2% speedup with only a marginal 1.1%
reduction in ROUGE-L.
Related papers
- Dynamic layer selection in decoder-only transformers [21.18795712840146]
We empirically examine two common dynamic inference methods for natural language generation.
We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping.
We also show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains.
arXiv Detail & Related papers (2024-10-26T00:44:11Z) - FT2Ra: A Fine-Tuning-Inspired Approach to Retrieval-Augmented Code Completion [24.964973946366335]
We develop a novel retrieval-based method, FT2Ra, which aims to mimic genuine fine-tuning.
FT2Ra achieves a 4.29% improvement in accuracy compared to the best baseline method on UniXcoder.
arXiv Detail & Related papers (2024-04-02T01:42:15Z) - Predicting Emergent Abilities with Infinite Resolution Evaluation [85.89911520190711]
We introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase.
We predict the performance of the 2.4B model on code generation with merely 0.05% deviation before training starts.
We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function.
arXiv Detail & Related papers (2023-10-05T02:35:00Z) - Towards Efficient Fine-tuning of Pre-trained Code Models: An
Experimental Study and Beyond [52.656743602538825]
Fine-tuning pre-trained code models incurs a large computational cost.
We conduct an experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning.
We propose Telly to efficiently fine-tune pre-trained code models via layer freezing.
arXiv Detail & Related papers (2023-04-11T13:34:13Z) - Controlled Sparsity via Constrained Optimization or: How I Learned to
Stop Tuning Penalties and Love Constraints [81.46143788046892]
We focus on the task of controlling the level of sparsity when performing sparse learning.
Existing methods based on sparsity-inducing penalties involve expensive trial-and-error tuning of the penalty factor.
We propose a constrained formulation where sparsification is guided by the training objective and the desired sparsity target in an end-to-end fashion.
arXiv Detail & Related papers (2022-08-08T21:24:20Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Adversarial Robustness Assessment of NeuroEvolution Approaches [1.237556184089774]
We evaluate the robustness of models found by two NeuroEvolution approaches on the CIFAR-10 image classification task.
Our results show that when the evolved models are attacked with iterative methods, their accuracy usually drops to, or close to, zero.
Some of these techniques can exacerbate the perturbations added to the original inputs, potentially harming robustness.
arXiv Detail & Related papers (2022-07-12T10:40:19Z) - Toward Less Hidden Cost of Code Completion with Acceptance and Ranking
Models [12.736207952790618]
We develop an ensemble framework that can combine results from multiple models to draw merits and offset defects of each model.
This paper conducts a coding simulation to collect data from code context and different code completion models.
We propose a new code completion evaluation metric, Benefit-Cost Ratio(BCR), taking into account the benefit of keystrokes saving and hidden cost of completion list browsing.
arXiv Detail & Related papers (2021-06-26T03:02:49Z) - Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z) - Confidence Adaptive Anytime Pixel-Level Recognition [86.75784498879354]
Anytime inference requires a model to make a progression of predictions which might be halted at any time.
We propose the first unified and end-to-end model approach for anytime pixel-level recognition.
arXiv Detail & Related papers (2021-04-01T20:01:57Z) - Accelerating Deep Learning Inference via Freezing [8.521443408415868]
We present Freeze Inference, a system that introduces approximate caching at each intermediate layer.
We find that this can potentially reduce the number of effective layers by half for 91.58% of CIFAR-10 requests run on ResNet-18.
arXiv Detail & Related papers (2020-02-07T07:03:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.