Assured LLM-Based Software Engineering
- URL: http://arxiv.org/abs/2402.04380v1
- Date: Tue, 6 Feb 2024 20:38:46 GMT
- Title: Assured LLM-Based Software Engineering
- Authors: Nadia Alshahwan, Mark Harman, Inna Harper, Alexandru Marginean, Shubho
Sengupta, Eddy Wang
- Abstract summary: This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
- Score: 51.003878077888686
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we address the following question: How can we use Large
Language Models (LLMs) to improve code independently of a human, while ensuring
that the improved code
- does not regress the properties of the original code?
- improves the original in a verifiable and measurable way?
To address this question, we advocate Assured LLM-Based Software Engineering;
a generate-and-test approach, inspired by Genetic Improvement. Assured LLMSE
applies a series of semantic filters that discard code that fails to meet these
twin guarantees. This overcomes the potential problem of LLM's propensity to
hallucinate. It allows us to generate code using LLMs, independently of any
human. The human plays the role only of final code reviewer, as they would do
with code generated by other human engineers.
This paper is an outline of the content of the keynote by Mark Harman at the
International Workshop on Interpretability, Robustness, and Benchmarking in
Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
Related papers
- Artificial-Intelligence Generated Code Considered Harmful: A Road Map for Secure and High-Quality Code Generation [2.793781561647737]
We compared the security and quality of human-written code with that of LLM-generated code.
We found that LLM can generate incorrect code that fails to implement the required functionality.
Flukeing has revealed that LLM-generated code is more prone to hangs and crashes than human-written code.
arXiv Detail & Related papers (2024-09-27T23:41:51Z) - InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct [43.7550233177368]
We propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse.
We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks.
arXiv Detail & Related papers (2024-07-08T08:00:05Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - Human-Instruction-Free LLM Self-Alignment with Limited Samples [64.69906311787055]
We propose an algorithm that can self-align large language models (LLMs) iteratively without active human involvement.
Unlike existing works, our algorithm relies on neither human-crafted instructions nor labeled rewards, significantly reducing human involvement.
We show that our method can unlock the LLMs' self-generalization ability to perform alignment with near-zero human supervision.
arXiv Detail & Related papers (2024-01-06T14:00:12Z) - DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial
Natural Language Instructions [27.489622263456983]
We introduce DeceptPrompt, an algorithm that can generate adversarial natural language instructions that drive the Code LLMs to generate functionality correct code with vulnerabilities.
When applying the optimized prefix/suffix, the attack success rate (ASR) will improve by average 50% compared with no prefix/suffix applying.
arXiv Detail & Related papers (2023-12-07T22:19:06Z) - Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability
of Large Language Model Code Generation [8.575560293086289]
Large language models (LLMs) have shown extraordinary ability in understanding natural language and generating programming code.
The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes.
arXiv Detail & Related papers (2023-08-20T18:36:28Z) - Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
Large Language Models for Code Generation [20.45045253933097]
We propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code.
EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator.
We show that HumanEval+ is able to catch significant amounts of previously undetected wrong code.
arXiv Detail & Related papers (2023-05-02T05:46:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.