CGEMs: A Metric Model for Automatic Code Generation using GPT-3
- URL: http://arxiv.org/abs/2108.10168v1
- Date: Mon, 23 Aug 2021 13:28:57 GMT
- Title: CGEMs: A Metric Model for Automatic Code Generation using GPT-3
- Authors: Aishwarya Narasimhan (1), Krishna Prasad Agara Venkatesha Rao (2),
Veena M B (1) ((1) B M S College of Engineering, (2) Sony India Software
Centre Pvt. Ltd.)
- Abstract summary: This work aims to validate AI-generated content using theoretical proofs or by using Monte-Carlo simulation methods.
In this case, we use the latter approach to test/validate a statistically significant number of samples.
The various metrics that are garnered in this work to support the evaluation of generated code are as follows: Compilation, NL description to logic conversion, number of edits needed, some of the commonly used static-code metrics and NLP metrics.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Today, AI technology is showing its strengths in almost every industry and
walks of life. From text generation, text summarization, chatbots, NLP is being
used widely. One such paradigm is automatic code generation. An AI could be
generating anything; hence the output space is unconstrained. A self-driving
car is driven for 100 million miles to validate its safety, but tests cannot be
written to monitor and cover an unconstrained space. One of the solutions to
validate AI-generated content is to constrain the problem and convert it from
abstract to realistic, and this can be accomplished by either validating the
unconstrained algorithm using theoretical proofs or by using Monte-Carlo
simulation methods. In this case, we use the latter approach to test/validate a
statistically significant number of samples. This hypothesis of validating the
AI-generated code is the main motive of this work and to know if AI-generated
code is reliable, a metric model CGEMs is proposed. This is an extremely
challenging task as programs can have different logic with different naming
conventions, but the metrics must capture the structure and logic of the
program. This is similar to the importance grammar carries in AI-based text
generation, Q&A, translations, etc. The various metrics that are garnered in
this work to support the evaluation of generated code are as follows:
Compilation, NL description to logic conversion, number of edits needed, some
of the commonly used static-code metrics and NLP metrics. These metrics are
applied to 80 codes generated using OpenAI's GPT-3. Post which a Neural network
is designed for binary classification (acceptable/not acceptable quality of the
generated code). The inputs to this network are the values of the features
obtained from the metrics. The model achieves a classification accuracy of
76.92% and an F1 score of 55.56%. XAI is augmented for model interpretability.
Related papers
- Benchmarking Large Language Models with Integer Sequence Generation Tasks [1.3108652488669736]
This paper presents a novel benchmark where the large language model (LLM) must write code that computes integer sequences from the Online Encyclopedia of Sequences (OEIS)
Our benchmark reveals that the o1 series of models outperform other frontier models from OpenAI, Anthropic, Meta, and Google in accuracy and cheating rates across both easy and hard integer sequences.
arXiv Detail & Related papers (2024-11-07T02:05:43Z) - An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We? [8.0988059417354]
We propose a range of approaches to improve the performance of AI-generated code detection.
Our best model outperforms state-of-the-art AI-generated code detector (GPTSniffer) and achieves an F1 score of 82.55.
arXiv Detail & Related papers (2024-11-06T22:48:18Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - GEC-DePenD: Non-Autoregressive Grammatical Error Correction with
Decoupled Permutation and Decoding [52.14832976759585]
Grammatical error correction (GEC) is an important NLP task that is usually solved with autoregressive sequence-to-sequence models.
We propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network.
We show that the resulting network improves over previously known non-autoregressive methods for GEC.
arXiv Detail & Related papers (2023-11-14T14:24:36Z) - Leveraging Generative AI: Improving Software Metadata Classification
with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration.
This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful"
We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z) - Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes.
We find that existing training-based or zero-shot text detectors are ineffective in detecting code.
Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z) - Is this Snippet Written by ChatGPT? An Empirical Study with a
CodeBERT-Based Classifier [13.613735709997911]
This paper presents an empirical study to investigate the feasibility of automated identification of AI-generated code snippets.
We propose a novel approach called GPTSniffer, which builds on top of CodeBERT to detect source code written by AI.
The results show that GPTSniffer can accurately classify whether code is human-written or AI-generated, and outperforms two baselines.
arXiv Detail & Related papers (2023-07-18T16:01:15Z) - Paraphrasing evades detectors of AI-generated text, but retrieval is an
effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering.
Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking.
We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z) - Who Evaluates the Evaluators? On Automatic Metrics for Assessing
AI-based Offensive Code Generators [1.7616042687330642]
Code generators are an emerging solution for automatically writing programs starting from descriptions in natural language.
In particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks.
This work analyzes a large set of output similarity metrics on offensive code generators.
arXiv Detail & Related papers (2022-12-12T16:16:09Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.