Related papers: CodeScore-R: An Automated Robustness Metric for Assessing the FunctionalCorrectness of Code Synthesis

CodeScore-R: An Automated Robustness Metric for Assessing the FunctionalCorrectness of Code Synthesis

URL: http://arxiv.org/abs/2406.06902v1
Date: Tue, 11 Jun 2024 02:51:17 GMT
Title: CodeScore-R: An Automated Robustness Metric for Assessing the FunctionalCorrectness of Code Synthesis
Authors: Guang Yang, Yu Zhou, Xiang Chen, Xiangyu Zhang,
Abstract summary: We propose an automated robust metric, called CodeScore-R, for evaluating the functionality of code synthesis. In the tasks of code generation and migration in Java and Python, CodeScore-R outperforms other metrics.
Score: 17.747095451792084
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluation metrics are crucial in the field of code synthesis. Commonly used code evaluation metrics canbe classified into three types: match-based, semantic-based, and execution-based. Among them, the execution-basedPass@k metric accurately assesses the functionality of predicted code by executing test cases. However, calculatingthis metric requires a significant amount of overhead, necessitating the design of an automated evaluation metric thatcan assess the functionality of predicted code without the need for test cases. Additionally, a good evaluation metricshould be robust, that is the metric can maintain its accuracy even when the predicted code undergoes minor changes.To address these challenges, we propose an automated robust metric, called CodeScore-R, based on UniXcoder andcontrastive learning, for evaluating the functionality of code synthesis. CodeScore-R employs techniques such assketch-based processing, syntactic-equivalent transformations, and mutation testing to effectively mitigate theinterference caused by identifiers, syntax structures, and operators on evaluation results. Experimental resultsdemonstrate that in the tasks of code generation and migration in Java and Python, CodeScore-R outperforms otherevaluation metrics and is more closely aligned with the Pass@k metric, while exhibiting stronger robustness.

Related papers

An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks [15.95854961699971]
We present SWE-Judge, the first evaluation metric for LLM-as-Ensemble-Judge.<n>SWE-Judge first defines five distinct evaluation strategies, each implemented as an independent judge.<n>A dynamic team selection mechanism then identifies the most appropriate subset of judges to produce a final correctness score.
arXiv Detail & Related papers (2025-05-27T08:04:34Z)
CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks [46.89839054706183]
We propose CROC: a framework for automated Contrastive Robustness Checks.<n>We generate a pseudo-labeled dataset of over one million contrastive prompt-image pairs.<n>We also use the dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods.
arXiv Detail & Related papers (2025-05-16T14:39:44Z)
PIER: A Novel Metric for Evaluating What Matters in Code-Switching [15.370845263369347]
Code-switching is a significant challenge for Automatic Speech Recognition. General metrics such as Word-Error-Rate (WER) are commonly used to measure performance. We propose Point-of-Interest Error Rate (PIER), a variant of WER that focuses only on specific words of interest.
arXiv Detail & Related papers (2025-01-16T12:57:33Z)
Can Large Language Models Serve as Evaluators for Code Summarization? [47.21347974031545]
Large Language Models (LLMs) serve as effective evaluators for code summarization methods. LLMs prompt an agent to play diverse roles, such as code reviewer, code author, code editor, and system analyst. CODERPE achieves an 81.59% Spearman correlation with human evaluations, outperforming the existing BERTScore metric by 17.27%.
arXiv Detail & Related papers (2024-12-02T09:56:18Z)
Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings [77.20838441870151]
Commit message generation is a crucial task in software engineering that is challenging to evaluate correctly. We use an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. Our results indicate that edit distance exhibits the highest correlation, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation.
arXiv Detail & Related papers (2024-10-15T20:32:07Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
On the Limitations of Embedding Based Methods for Measuring Functional Correctness for Code Generation [4.065344017083881]
We analyze the ability of embedding-based metrics like CodeBERTScore to measure functional correctness and other helpful constructs like editing effort. Our results show that while they have a weak correlation with functional correctness (0.16), they are strongly correlated (0.72) with editing effort.
arXiv Detail & Related papers (2024-04-26T15:54:39Z)
Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind. A final factuality score is computed by an adjustable scoring mechanism. Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z)
ICE-Score: Instructing Large Language Models to Evaluate Code [7.556444391696562]
We propose textttICE-Score, a new evaluation metric via instructing large language models for code assessments. Our metric addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences. Our results demonstrate that our metric surpasses state-of-the-art metrics for code generation.
arXiv Detail & Related papers (2023-04-27T16:38:17Z)
CodeScore: Evaluating Code Generation by Learning Code Execution [34.08307174529496]
We propose CodeScore, a large language model (LLM)-based CEM, which estimates the functional correctness of generated code on three input formats. CodeScore absolutely improves up to 58.87% correlation with functional correctness compared to other CEMs, achieves state-of-the-art performance, and effectively handles three input formats.
arXiv Detail & Related papers (2023-01-22T02:59:59Z)
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available. We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone. T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z)
ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis [57.87741831987889]
In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy. We introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow.
arXiv Detail & Related papers (2020-09-22T03:10:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.