Related papers: GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities

GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities

URL: http://arxiv.org/abs/2507.12367v2
Date: Mon, 21 Jul 2025 21:56:07 GMT
Title: GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities
Authors: Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang, Justine Gehring, Antonio Orvieto, Muawiz Chaudhary, Eilif B. Muller, Irina Rish, Samira Ebrahimi Kahou, Massimo Caccia,
Abstract summary: We introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems.<n>GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation.
Score: 26.381134558374743
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.

Related papers

IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Turning the Tide: Repository-based Code Reflection [52.13709676656648]
We introduce LiveRepoReflection, a benchmark for evaluating code understanding and generation in multi-file repository contexts.<n>1,888 rigorously filtered test cases across $6$ programming languages to ensure diversity, correctness, and high difficulty.<n>We also create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources.
arXiv Detail & Related papers (2025-07-14T02:36:27Z)
CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale [39.54772602678732]
This paper introduces CODESYNC, a data engine for identifying outdated code patterns.<n>Building upon CODESYNC, we develop CODESYNCBENCH, a benchmark for assessing Large Language Models' ability to stay synchronized with code evolution.
arXiv Detail & Related papers (2025-02-23T16:46:18Z)
LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation [40.87656746406113]
We introduce LibEvolutionEval, a study requiring an understanding of library evolution to perform in-line code completion accurately.<n>We evaluate popular public models and find that public library evolution significantly influences model performance.<n>We explore mitigation methods by studying how retrieved version-specific library documentation and prompting can improve the model's capability in handling fast-evolving packages.
arXiv Detail & Related papers (2024-11-19T21:52:23Z)
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models [16.6780665807022]
textbfGitChameleon is a novel, manually curated dataset comprising 116 Python code completion problems. GitChameleon is designed to rigorously assess the ability of modern large language models to generate version-specific code.
arXiv Detail & Related papers (2024-11-05T23:34:06Z)
CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.<n>We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.<n>We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z)
ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.