Related papers: CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

URL: http://arxiv.org/abs/2407.06249v1
Date: Mon, 8 Jul 2024 17:55:04 GMT
Title: CodeUpdateArena: Benchmarking Knowledge Editing on API Updates
Authors: Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, Greg Durrett,
Abstract summary: We present CodeUpdateArena, a benchmark for knowledge editing in the code domain. An instance in our benchmark consists of a synthetic API function update paired with a program synthesis example. Our benchmark covers updates of various types to 54 functions from seven diverse Python packages.
Score: 77.81663273436375
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly being used to synthesize and reason about source code. However, the static nature of these models' knowledge does not reflect the fact that libraries and API functions they invoke are continuously evolving, with functionality being added or changing. While numerous benchmarks evaluate how LLMs can generate code, no prior work has studied how an LLMs' knowledge about code API functions can be updated. To fill this gap, we present CodeUpdateArena, a benchmark for knowledge editing in the code domain. An instance in our benchmark consists of a synthetic API function update paired with a program synthesis example that uses the updated functionality; our goal is to update an LLM to be able to solve this program synthesis example without providing documentation of the update at inference time. Compared to knowledge editing for facts encoded in text, success here is more challenging: a code LLM must correctly reason about the semantics of the modified function rather than just reproduce its syntax. Our dataset is constructed by first prompting GPT-4 to generate atomic and executable function updates. Then, for each update, we generate program synthesis examples whose code solutions are prone to use the update. Our benchmark covers updates of various types to 54 functions from seven diverse Python packages, with a total of 670 program synthesis examples. Our experiments show that prepending documentation of the update to open-source code LLMs (i.e., DeepSeek, CodeLlama) does not allow them to incorporate changes for problem solving, and existing knowledge editing techniques also have substantial room for improvement. We hope our benchmark will inspire new methods for knowledge updating in code LLMs.

Related papers

When LLMs Meet API Documentation: Can Retrieval Augmentation Aid Code Generation Just as It Helps Developers? [10.204379646375182]
Retrieval-augmented generation (RAG) has increasingly shown its power in extending large language models' (LLMs') capability beyond their pre-trained knowledge. We study the factors that affect the effectiveness of using the documentation of less common API libraries as additional knowledge for retrieval and generation.
arXiv Detail & Related papers (2025-03-19T14:08:47Z)
Exploring the Capabilities of LLMs for Code Change Related Tasks [14.261870410238643]
Large language models (LLMs) have shown their effectiveness in code-related tasks. LLMs focus on general code syntax and semantics rather than the differences between two code versions. We conduct an empirical study using textgreater 1B parameters LLMs on three code-change-related tasks.
arXiv Detail & Related papers (2024-07-03T05:49:18Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code. We find that code prompting exhibits a high-performance boost for multiple LLMs. Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions [6.367360745627828]
We introduce a benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. We introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions.
arXiv Detail & Related papers (2023-12-11T02:27:45Z)
InstructCoder: Instruction Tuning Large Language Models for Code Editing [26.160498475809266]
We explore the use of Large Language Models (LLMs) to edit code based on user instructions. InstructCoder is the first instruction-tuning dataset designed to adapt LLMs for general-purpose code editing. Our findings reveal that open-source LLMs fine-tuned on InstructCoder can significantly enhance the accuracy of code edits.
arXiv Detail & Related papers (2023-10-31T10:15:35Z)
CodeT5+: Open Code Large Language Models for Code Understanding and Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z)
Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization) [7.699967852459232]
Developers tend to consciously and unconsciously have a collection of semantics facts in mind when working on coding tasks. One might assume that the powerful multi-layer architecture of transformer-style LLMs makes them inherently capable of doing this simple level of "code analysis" We evaluate whether automatically augmenting an LLM's prompt with semantic facts explicitly, actually helps.
arXiv Detail & Related papers (2023-04-13T20:49:35Z)
Automatically Recommend Code Updates: Are We There Yet? [14.997510035210842]
We present the first evaluation of state-of-the-art CodeLMs for automatically recommending code updates. Our results reveal that while CodeLMs perform well in settings that ignore temporal information, they struggle in more realistic time-wise scenarios. Our findings highlight the significant gap between the perceived and actual effectiveness of CodeLMs for real-world code update recommendation.
arXiv Detail & Related papers (2022-09-15T05:07:25Z)
ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.