Related papers: RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing

RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing

URL: http://arxiv.org/abs/2602.22518v1
Date: Thu, 26 Feb 2026 01:25:00 GMT
Title: RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing
Authors: Xuefeng Li, Nir Ben-Israel, Yotam Raz, Belal Ahmed, Doron Serebro, Antoine Raux,
Abstract summary: We introduce a benchmarking framework for repository-level code modernization built on an implementation-agnostic evaluation paradigm.<n>RepoMod-Bench is a benchmark of 21 real-world repositories with standardized interfaces, spanning 8 languages.<n>The benchmark contains 1.6M lines of code (LOC) and 11,616 tests, with repository sizes ranging from 14 to 211K LOC.
Score: 1.4069797812477614
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evolution of AI coding agents has shifted the frontier from simple snippet completion to autonomous repository-level engineering. However, evaluating these agents remains ill-posed in general code repository generation, where the lack of deterministic ground truth leads to ambiguous metrics. Code modernization via automated translation offers a more rigorous alternative by providing a fixed ground truth -- the source repository; yet existing benchmarks are limited to small-scale repositories and rely on language-specific unit tests visible to the agent, allowing test-driven overfitting. We address these limitations by introducing a benchmarking framework for repository-level code modernization built on an implementation-agnostic evaluation paradigm. This framework is instantiated through RepoMod-Bench: a benchmark of 21 real-world repositories with standardized interfaces, spanning 8 programming languages. The benchmark contains 1.6M lines of code (LOC) and 11,616 tests, with repository sizes ranging from 14 to 211K LOC. By targeting repositories with standardized interfaces, we utilize an implementation-agnostic test suite to verify functional equivalence between source and target implementations. This black-box approach ensures verification remains consistent across languages, and our environment hides all test suites from agents to prevent test-driven shortcuts. Evaluating four state-of-the-art agent configurations reveals a sharp scaling collapse: average pass rates drop from 91.3% on projects under 10K LOC to 15.3% on projects exceeding 50K LOC. These results demonstrate that autonomous modernization at scale remains a significant open challenge. Our benchmark and code are available at https://github.com/Modelcode-ai/mcode-benchmark.

Related papers

DEP: A Decentralized Large Language Model Evaluation Protocol [51.3646001384887]
Decentralized Evaluation Protocol (DEP) is a decentralized yet unified and standardized evaluation framework.<n>By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation.<n>We develop DEP Toolkit, a protocol-compatible toolkit that supports features such as breakpoint resume, concurrent requests, and congestion control.
arXiv Detail & Related papers (2026-03-01T16:10:16Z)
SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation [1.0010193170880752]
We introduce a neuro-symbolic, scenario-based framework that bridges the gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management.<n>We evaluate it on 59 real-world and algorithmic subjects, where it outperforms the vanilla prompt generation baseline by 31.36% in line coverage, 26.01% in branch coverage, and 20.78% in mutation score, matching or exceeding the symbolic execution tool KLEE.
arXiv Detail & Related papers (2026-02-18T18:09:03Z)
FeatureBench: Benchmarking Agentic Coding for Complex Feature Development [42.26354337364403]
FeatureBench is a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development.<n>It incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort.<n> Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, achieves a 74.4% resolved rate on SWE-bench.
arXiv Detail & Related papers (2026-02-11T16:06:32Z)
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z)
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents [79.29376673236142]
Existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems.<n>We present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents.
arXiv Detail & Related papers (2025-12-14T15:12:13Z)
TestForge: Feedback-Driven, Agentic Test Suite Generation [7.288137795439405]
TestForge is an agentic unit testing framework designed to cost-effectively generate high-quality test suites for real-world code.<n>TestForge produces more natural and understandable tests compared to state-of-the-art search-based techniques.
arXiv Detail & Related papers (2025-03-18T20:21:44Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation [37.25839260805938]
Skeleton-Guided-Translation is a framework for repository-level Java to C# code translation with fine-grained quality evaluation.<n>We present TRANSREPO-BENCH, a benchmark of high quality open-source Java repositories and their corresponding C# skeletons.
arXiv Detail & Related papers (2025-01-27T13:44:51Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.