COCO: Testing Code Generation Systems via Concretized Instructions
- URL: http://arxiv.org/abs/2308.13319v1
- Date: Fri, 25 Aug 2023 11:49:27 GMT
- Title: COCO: Testing Code Generation Systems via Concretized Instructions
- Authors: Ming Yan, Junjie Chen, Jie M. Zhang, Xuejie Cao, Chen Yang, Mark
Harman
- Abstract summary: COCO is a technique to test the robustness of code generation systems.
It exploits the usage scenario of code generation systems to make the original programming instruction more concrete.
We evaluated COCO on eight advanced code generation systems, including commercial tools such as Copilot and ChatGPT.
- Score: 33.13427092832396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code generation systems have been extensively developed in recent years to
generate source code based on natural language instructions. However, despite
their advancements, these systems still face robustness issues where even
slightly different instructions can result in significantly different code
semantics. Robustness is critical for code generation systems, as it can have
significant impacts on software development, software quality, and trust in the
generated code. Although existing testing techniques for general text-to-text
software can detect some robustness issues, they are limited in effectiveness
due to ignoring the characteristics of code generation systems. In this work,
we propose a novel technique COCO to test the robustness of code generation
systems. It exploits the usage scenario of code generation systems to make the
original programming instruction more concrete by incorporating features known
to be contained in the original code. A robust system should maintain code
semantics for the concretized instruction, and COCO detects robustness
inconsistencies when it does not. We evaluated COCO on eight advanced code
generation systems, including commercial tools such as Copilot and ChatGPT,
using two widely-used datasets. Our results demonstrate the effectiveness of
COCO in testing the robustness of code generation systems, outperforming two
techniques adopted from general text-to-text software testing by 466.66% and
104.02%, respectively. Furthermore, concretized instructions generated by COCO
can help reduce robustness inconsistencies by 18.35% to 53.91% through
fine-tuning.
Related papers
- Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
We investigate the capability of LLM-based Code Agents to formalize user issues into test cases.
We propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests.
We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed for test generation.
arXiv Detail & Related papers (2024-06-18T14:54:37Z) - CoCoST: Automatic Complex Code Generation with Online Searching and Correctness Testing [51.00909683314142]
Large Language Models have revolutionized code generation ability by converting natural language descriptions into executable code.
CoCoST framework enhances complex code generation by online searching for more information with planned queries and correctness testing for code refinement.
CoCoST is validated through rigorous experiments on the DS-1000 and ClassEval datasets.
arXiv Detail & Related papers (2024-03-20T13:33:55Z) - Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers [14.018844722021896]
We study the specific patterns that characterize machine- and human-authored code.
We propose DetectCodeGPT, a novel method for detecting machine-generated code.
arXiv Detail & Related papers (2024-01-12T09:15:20Z) - No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT [28.68768157452352]
This study examines the quality of code generation using ChatGPT.
We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task.
Our findings uncover potential issues and limitations that arise in the ChatGPT-based code generation.
arXiv Detail & Related papers (2023-08-09T10:01:09Z) - Execution-based Code Generation using Deep Reinforcement Learning [8.085533911328577]
PPOCoder is a new framework for code generation that combines pre-trained PL models with Proximal Policy Optimization.
PPOCoder seamlessly integrates external code-specific knowledge into the model optimization process.
It's important to note that PPOCoder is a task-agnostic and model-agnostic framework that can be used across different code generation tasks and PLs.
arXiv Detail & Related papers (2023-01-31T18:02:26Z) - ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z) - Compilable Neural Code Generation with Compiler Feedback [43.97362484564799]
This paper proposes a three-stage pipeline for compilable code generation, including language model fine-tuning, compilability reinforcement, and compilability discrimination.
Experiments on two code generation tasks demonstrate the effectiveness of our proposed approach, improving the success rate of compilation from 44.18 to 89.18 on average and from 70.3 to 96.2 in text-to-code generation, respectively.
arXiv Detail & Related papers (2022-03-10T03:15:17Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.