CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
- URL: http://arxiv.org/abs/2009.10297v2
- Date: Sun, 27 Sep 2020 04:07:11 GMT
- Title: CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
- Authors: Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel
Sundaresan, Ming Zhou, Ambrosio Blanco, Shuai Ma
- Abstract summary: In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy.
We introduce a new automatic evaluation metric, dubbed CodeBLEU.
It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow.
- Score: 57.87741831987889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluation metrics play a vital role in the growth of an area as it defines
the standard of distinguishing between good and bad models. In the area of code
synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but
they are not suitable enough to evaluate codes, because BLEU is originally
designed to evaluate the natural language, neglecting important syntactic and
semantic features of codes, and perfect accuracy is too strict thus it
underestimates different outputs with the same semantic logic. To remedy this,
we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the
strength of BLEU in the n-gram match and further injects code syntax via
abstract syntax trees (AST) and code semantics via data-flow. We conduct
experiments by evaluating the correlation coefficient between CodeBLEU and
quality scores assigned by the programmers on three code synthesis tasks, i.e.,
text-to-code, code translation, and code refinement. Experimental results show
that our proposed CodeBLEU can achieve a better correlation with programmer
assigned scores compared with BLEU and accuracy.
Related papers
- Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - CodeScore-R: An Automated Robustness Metric for Assessing the FunctionalCorrectness of Code Synthesis [17.747095451792084]
We propose an automated robust metric, called CodeScore-R, for evaluating the functionality of code synthesis.
In the tasks of code generation and migration in Java and Python, CodeScore-R outperforms other metrics.
arXiv Detail & Related papers (2024-06-11T02:51:17Z) - When simplicity meets effectiveness: Detecting code comments coherence with word embeddings and LSTM [6.417777780911223]
Code comments play a crucial role in software development, as they provide programmers with practical information.
Developers tend to leave comments unchanged after updating the code, resulting in a discrepancy between the two artifacts.
It is crucial to identify if, given a code snippet, its corresponding comment is coherent and reflects well the intent behind the code.
arXiv Detail & Related papers (2024-05-25T15:21:27Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - On the Limitations of Embedding Based Methods for Measuring Functional Correctness for Code Generation [4.065344017083881]
We analyze the ability of embedding-based metrics like CodeBERTScore to measure functional correctness and other helpful constructs like editing effort.
Our results show that while they have a weak correlation with functional correctness (0.16), they are strongly correlated (0.72) with editing effort.
arXiv Detail & Related papers (2024-04-26T15:54:39Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Towards Computationally Verifiable Semantic Grounding for Language
Models [18.887697890538455]
The paper conceptualizes the LM as a conditional model generating text given a desired semantic message formalized as a set of entity-relationship triples.
It embeds the LM in an auto-encoder by feeding its output to a semantic fluency whose output is in the same representation domain as the input message.
We show that our proposed approaches significantly improve on the greedy search baseline.
arXiv Detail & Related papers (2022-11-16T17:35:52Z) - Soft-Labeled Contrastive Pre-training for Function-level Code
Representation [127.71430696347174]
We present textbfSCodeR, a textbfSoft-labeled contrastive pre-training framework with two positive sample construction methods.
Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels.
SCodeR achieves new state-of-the-art performance on four code-related tasks over seven datasets.
arXiv Detail & Related papers (2022-10-18T05:17:37Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.