Related papers: Who Evaluates the Evaluators? On Automatic Metrics for Assessing AI-based Offensive Code Generators

Who Evaluates the Evaluators? On Automatic Metrics for Assessing AI-based Offensive Code Generators

URL: http://arxiv.org/abs/2212.06008v3
Date: Thu, 13 Apr 2023 11:25:00 GMT
Title: Who Evaluates the Evaluators? On Automatic Metrics for Assessing AI-based Offensive Code Generators
Authors: Pietro Liguori, Cristina Improta, Roberto Natella, Bojan Cukic, and Domenico Cotroneo
Abstract summary: Code generators are an emerging solution for automatically writing programs starting from descriptions in natural language. In particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks. This work analyzes a large set of output similarity metrics on offensive code generators.
Score: 1.7616042687330642
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI-based code generators are an emerging solution for automatically writing programs starting from descriptions in natural language, by using deep neural networks (Neural Machine Translation, NMT). In particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks. Unfortunately, the evaluation of code generators still faces several issues. The current practice uses output similarity metrics, i.e., automatic metrics that compute the textual similarity of generated code with ground-truth references. However, it is not clear what metric to use, and which metric is most suitable for specific contexts. This work analyzes a large set of output similarity metrics on offensive code generators. We apply the metrics on two state-of-the-art NMT models using two datasets containing offensive assembly and Python code with their descriptions in the English language. We compare the estimates from the automatic metrics with human evaluation and provide practical insights into their strengths and limitations.

Related papers

Generating Unseen Code Tests In Infinitum [1.0674604700001968]
We present a method for creating benchmark variations that generalize across coding tasks and programming languages. We implement one benchmark, called textitauto-regression, for the task of text-to-code generation in Python.
arXiv Detail & Related papers (2024-07-29T08:11:20Z)
Leveraging Generative AI: Improving Software Metadata Classification with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration. This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful" We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z)
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics. We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z)
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available. We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone. T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z)
Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent. It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics. We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z)
Quality-Aware Decoding for Neural Machine Translation [64.24934199944875]
We propose quality-aware decoding for neural machine translation (NMT) We leverage recent breakthroughs in reference-free and reference-based MT evaluation through various inference methods. We find that quality-aware decoding consistently outperforms MAP-based decoding according both to state-of-the-art automatic metrics and to human assessments.
arXiv Detail & Related papers (2022-05-02T15:26:28Z)
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards) Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z)
CGEMs: A Metric Model for Automatic Code Generation using GPT-3 [0.0]
This work aims to validate AI-generated content using theoretical proofs or by using Monte-Carlo simulation methods. In this case, we use the latter approach to test/validate a statistically significant number of samples. The various metrics that are garnered in this work to support the evaluation of generated code are as follows: Compilation, NL description to logic conversion, number of edits needed, some of the commonly used static-code metrics and NLP metrics.
arXiv Detail & Related papers (2021-08-23T13:28:57Z)
Retrieve and Refine: Exemplar-based Neural Comment Generation [27.90756259321855]
Comments of similar code snippets are helpful for comment generation. We design a novel seq2seq neural network that takes the given code, its AST, its similar code, and its exemplar as input. We evaluate our approach on a large-scale Java corpus, which contains about 2M samples.
arXiv Detail & Related papers (2020-10-09T09:33:10Z)
Incorporating External Knowledge through Pre-training for Natural Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents. We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.