Mutation-based Consistency Testing for Evaluating the Code Understanding
Capability of LLMs
- URL: http://arxiv.org/abs/2401.05940v1
- Date: Thu, 11 Jan 2024 14:27:43 GMT
- Title: Mutation-based Consistency Testing for Evaluating the Code Understanding
Capability of LLMs
- Authors: Ziyu Li, Donghwan Shin
- Abstract summary: Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages.
We propose a novel method to assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions.
We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs.
We conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X.
- Score: 5.549095839198671
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have shown remarkable capabilities in processing
both natural and programming languages, which have enabled various applications
in software engineering, such as requirement engineering, code generation, and
software testing. However, existing code generation benchmarks do not
necessarily assess the code understanding performance of LLMs, especially for
the subtle inconsistencies that may arise between code and its semantics
described in natural language.
In this paper, we propose a novel method to systematically assess the code
understanding performance of LLMs, particularly focusing on subtle differences
between code and its descriptions, by introducing code mutations to existing
code generation datasets. Code mutations are small changes that alter the
semantics of the original code, creating a mismatch with the natural language
description. We apply different types of code mutations, such as operator
replacement and statement deletion, to generate inconsistent code-description
pairs. We then use these pairs to test the ability of LLMs to correctly detect
the inconsistencies.
We propose a new LLM testing method, called Mutation-based Consistency
Testing (MCT), and conduct a case study on the two popular LLMs, GPT-3.5 and
GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X, which
consists of six programming languages (Python, C++, Java, Go, JavaScript, and
Rust). We compare the performance of the LLMs across different types of code
mutations and programming languages and analyze the results. We find that the
LLMs show significant variation in their code understanding performance and
that they have different strengths and weaknesses depending on the mutation
type and language.
Related papers
- Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - Source Code Summarization in the Era of Large Language Models [23.715005053430957]
Large language models (LLMs) have led to a great boost in the performance of code-related tasks.
In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs.
arXiv Detail & Related papers (2024-07-09T05:48:42Z) - An Empirical Study on Capability of Large Language Models in Understanding Code Semantics [4.638578225024275]
Large Language Models for Code (code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks.
This paper introduces EMPICA, a framework designed to evaluate the capabilities of code LLMs in understanding code semantics.
arXiv Detail & Related papers (2024-07-04T03:40:58Z) - Where Do Large Language Models Fail When Generating Code? [10.519984835232359]
Large Language Models (LLMs) have shown great potential in code generation.
It is unclear what kinds of code generation errors LLMs can make.
We analyzed incorrect code snippets generated by six popular LLMs on the HumanEval dataset.
arXiv Detail & Related papers (2024-06-13T01:29:52Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z) - Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code.
We find that code prompting exhibits a high-performance boost for multiple LLMs.
Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - Testing LLMs on Code Generation with Varying Levels of Prompt
Specificity [0.0]
Large language models (LLMs) have demonstrated unparalleled prowess in mimicking human-like text generation and processing.
The potential to transform natural language prompts into executable code promises a major shift in software development practices.
arXiv Detail & Related papers (2023-11-10T23:41:41Z) - Test-Case-Driven Programming Understanding in Large Language Models for
Better Code Generation [15.166827643436346]
muFiX is a novel prompting technique to improve the code generation performance of large language models (LLMs)
It first exploits test case analysis to obtain specification understanding and enables a self-improvement process.
muFiX further fixes the specification understanding towards the direction reducing the gap between the provided understanding and the actual understanding.
arXiv Detail & Related papers (2023-09-28T02:58:07Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.