Language Models for Code Completion: A Practical Evaluation
- URL: http://arxiv.org/abs/2402.16197v1
- Date: Sun, 25 Feb 2024 20:43:55 GMT
- Title: Language Models for Code Completion: A Practical Evaluation
- Authors: Maliheh Izadi, Jonathan Katzy, Tim van Dam, Marc Otten, Razvan Mihai
Popescu, Arie van Deursen
- Abstract summary: This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code.
We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions.
We found that 66.3% of failures were due to the models' limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote.
- Score: 13.174471984950857
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transformer-based language models for automatic code completion have shown
great promise so far, yet the evaluation of these models rarely uses real data.
This study provides both quantitative and qualitative assessments of three
public code language models when completing real-world code. We first developed
an open-source IDE extension, Code4Me, for the online evaluation of the models.
We collected real auto-completion usage data for over a year from more than
1200 users, resulting in over 600K valid completions. These models were then
evaluated using six standard metrics across twelve programming languages. Next,
we conducted a qualitative study of 1690 real-world completion requests to
identify the reasons behind the poor model performance. A comparative analysis
of the models' performance in online and offline settings was also performed,
using benchmark synthetic datasets and two masking strategies. Our findings
suggest that while developers utilize code completion across various languages,
the best results are achieved for mainstream languages such as Python and Java.
InCoder outperformed the other models across all programming languages,
highlighting the significance of training data and objectives. Our study also
revealed that offline evaluations do not accurately reflect real-world
scenarios. Upon qualitative analysis of the model's predictions, we found that
66.3% of failures were due to the models' limitations, 24.4% occurred due to
inappropriate model usage in a development context, and 9.3% were valid
requests that developers overwrote. Given these findings, we propose several
strategies to overcome the current limitations. These include refining training
objectives, improving resilience to typographical errors, adopting hybrid
approaches, and enhancing implementations and usability.
Related papers
- An evaluation of LLM code generation capabilities through graded exercises [0.7070726553564699]
We run a new evaluation of the performance of one state-of-the-art model (GPT4-o-mini) in solving curated coding challenges in 8 programming languages.
Our analysis shows that the chance of success of the model has a positive correlation with the task difficulty.
A further approximate explanatory analysis in terms of high-level features hints that while 46.6% of the model performance could be attributed to task difficulty, a 37.4% seems to be related to leakage of the challenge solutions into the model training set.
arXiv Detail & Related papers (2024-10-06T09:54:54Z) - Lessons from the Trenches on Reproducible Evaluation of Language Models [60.522749986793094]
We draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers.
We present the Language Model Evaluation Harness (lm-eval), an open source library for independent, reproducible, and evaluation of language models.
arXiv Detail & Related papers (2024-05-23T16:50:49Z) - Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation [8.81447711370817]
We empirically analyze the generated outputs of eleven popular instruct-tuned large language models (LLMs)
Our results demonstrate that a strategic combination of prompt engineering and regular expression can effectively extract the source code from the model generation output.
arXiv Detail & Related papers (2024-03-25T21:41:31Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources [117.6496550359768]
This work explores recent advances in instruction-tuning language models on a range of open instruction-following datasets.
We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets.
We evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities.
arXiv Detail & Related papers (2023-06-07T19:59:23Z) - Towards Better Instruction Following Language Models for Chinese:
Investigating the Impact of Training Data and Evaluation [12.86275938443485]
We examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance.
We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios.
We extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3.
arXiv Detail & Related papers (2023-04-16T18:37:39Z) - On the Reliability and Explainability of Language Models for Program
Generation [15.569926313298337]
We study the capabilities and limitations of automated program generation approaches.
We employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation.
Our analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences.
arXiv Detail & Related papers (2023-02-19T14:59:52Z) - Holistic Evaluation of Language Models [183.94891340168175]
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood.
We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models.
arXiv Detail & Related papers (2022-11-16T18:51:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.