Out of the BLEU: how should we assess quality of the Code Generation
models?
- URL: http://arxiv.org/abs/2208.03133v2
- Date: Wed, 10 May 2023 11:14:17 GMT
- Title: Out of the BLEU: how should we assess quality of the Code Generation
models?
- Authors: Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, Timofey Bryksin
- Abstract summary: We present a study on the applicability of six metrics -- BLEU, ROUGE-L, METEOR, ChrF, CodeBLEU, and RUBY -- for evaluation of code generation models.
None of the metrics can correctly emulate human judgement on which model is better with >95% certainty if the difference in model scores is less than 5 points.
Our findings suggest that the ChrF metric is a better fit for the evaluation of code generation models than the commonly used BLEU and CodeBLEU.
- Score: 3.699097874146491
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, researchers have created and introduced a significant number
of various code generation models. As human evaluation of every new model
version is unfeasible, the community adopted automatic evaluation metrics such
as BLEU to approximate the results of human judgement. These metrics originate
from the machine translation domain and it is unclear whether they are
applicable for the code generation tasks and how well they agree with the human
evaluation on this task. There are also other metrics, CodeBLEU and RUBY,
developed to estimate the similarity of code, that take into account the
properties of source code. However, for these metrics there are hardly any
studies on their agreement with the human evaluation. Despite all that, minimal
differences in the metric scores have been used in recent papers to claim
superiority of some code generation models over the others.
In this paper, we present a study on the applicability of six metrics --
BLEU, ROUGE-L, METEOR, ChrF, CodeBLEU, and RUBY -- for evaluation of code
generation models. We conduct a study on two different code generation datasets
and use human annotators to assess the quality of all models run on these
datasets. The results indicate that for the CoNaLa dataset of Python
one-liners, none of the metrics can correctly emulate human judgement on which
model is better with >95% certainty if the difference in model scores is less
than 5 points. For the HearthStone dataset, which consists of classes of a
particular structure, a difference in model scores of at least 2 points is
enough to claim the superiority of one model over the other. Our findings
suggest that the ChrF metric is a better fit for the evaluation of code
generation models than the commonly used BLEU and CodeBLEU. Yet, finding a
metric for code generation that closely agrees with humans requires additional
work.
Related papers
- Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [69.38352966504401]
We investigate the legal and ethical issues of current neural code completion models.
We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks.
We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions.
Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments.
This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z) - Method-Level Bug Severity Prediction using Source Code Metrics and LLMs [0.628122931748758]
We investigate source code metrics, source code representation using large language models (LLMs), and their combination in predicting bug severity labels.
Our results suggest that Decision Tree and Random Forest models outperform other models regarding our several evaluation metrics.
CodeBERT finetuning improves the bug severity prediction results significantly in the range of 29%-140% for several evaluation metrics.
arXiv Detail & Related papers (2023-09-06T14:38:07Z) - ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - A Study on the Evaluation of Generative Models [19.18642459565609]
Implicit generative models, which do not return likelihood values, have become prevalent in recent years.
In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset.
Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably.
arXiv Detail & Related papers (2022-06-22T09:27:31Z) - Code to Comment Translation: A Comparative Study on Model Effectiveness
& Errors [19.653423881863834]
Machine translation models are employed to "translate" code snippets into relevant natural language descriptions.
Most evaluations of such models are conducted using automatic reference-based metrics.
We compare three recently proposed source code summarization models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics.
Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an empirically derived error taxonomy.
arXiv Detail & Related papers (2021-06-15T20:13:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.