Aligning Offline Metrics and Human Judgments of Value for Code
Generation Models
- URL: http://arxiv.org/abs/2210.16494v2
- Date: Tue, 13 Jun 2023 17:45:30 GMT
- Title: Aligning Offline Metrics and Human Judgments of Value for Code
Generation Models
- Authors: Victor Dibia, Adam Fourney, Gagan Bansal, Forough Poursabzi-Sangdeh,
Han Liu and Saleema Amershi
- Abstract summary: We show that while correctness captures high-value generations, programmers still rate code that fails unit tests as valuable if it reduces the overall effort needed to complete a coding task.
We propose a hybrid metric that combines functional correctness and syntactic similarity and show that it achieves a 14% stronger correlation with value.
- Score: 25.726216146776054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have demonstrated great potential to assist programmers
in generating code. For such human-AI pair programming scenarios, we
empirically demonstrate that while generated code is most often evaluated in
terms of their functional correctness (i.e., whether generations pass available
unit tests), correctness does not fully capture (e.g., may underestimate) the
productivity gains these models may provide. Through a user study with N = 49
experienced programmers, we show that while correctness captures high-value
generations, programmers still rate code that fails unit tests as valuable if
it reduces the overall effort needed to complete a coding task. Finally, we
propose a hybrid metric that combines functional correctness and syntactic
similarity and show that it achieves a 14% stronger correlation with value and
can therefore better represent real-world gains when evaluating and comparing
models.
Related papers
- LLM-based Test-driven Interactive Code Generation: User Study and Empirical Evaluation [13.800675921118348]
We propose a novel interactive workflow TiCoder for guided intent clarification.
We present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy.
We observe an average absolute improvement of 38.43% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions.
arXiv Detail & Related papers (2024-04-15T19:16:32Z) - The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers [44.28269395385471]
We study whether gains on existing benchmarks translate to gains in programmer productivity when coding with large language models (LLMs)
We introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support.
We find that programmer preferences do not correlate with their actual performance, motivating the need for better, human-centric proxy signals.
arXiv Detail & Related papers (2024-04-03T15:20:57Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - ICE-Score: Instructing Large Language Models to Evaluate Code [7.556444391696562]
We propose textttICE-Score, a new evaluation metric via instructing large language models for code assessments.
Our metric addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences.
Our results demonstrate that our metric surpasses state-of-the-art metrics for code generation.
arXiv Detail & Related papers (2023-04-27T16:38:17Z) - AI-assisted coding: Experiments with GPT-4 [0.22366638308792727]
GPT-4 can generate tests with substantial coverage, but that many of the tests fail applied to the associated code.
These findings suggest that while AI coding tools are very powerful, they still require humans in the loop to ensure validity and accuracy of the results.
arXiv Detail & Related papers (2023-04-25T22:59:01Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - On the Reliability and Explainability of Language Models for Program
Generation [15.569926313298337]
We study the capabilities and limitations of automated program generation approaches.
We employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation.
Our analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences.
arXiv Detail & Related papers (2023-02-19T14:59:52Z) - ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z) - Coder Reviewer Reranking for Code Generation [56.80381384717]
We propose Coder-Reviewer reranking as a method for sampling diverse programs from a code language model and reranking with model likelihood.
Experimental results show that Coder-Reviewer reranking leads to consistent and significant improvement over reranking with the Coder model only.
Coder-Reviewer reranking is easy to implement by prompting, can generalize to different programming languages, and works well with off-the-shelf hyper parameters.
arXiv Detail & Related papers (2022-11-29T18:56:33Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.