Related papers: RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository

RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository

URL: http://arxiv.org/abs/2601.13943v2
Date: Thu, 22 Jan 2026 09:31:11 GMT
Title: RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository
Authors: Zhiyuan Peng, Xin Yin, Pu Zhao, Fangkai Yang, Lu Wang, Ran Jia, Xu Chen, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang,
Abstract summary: RepoGenesis is the first multilingual benchmark for repository-level end-to-end web microservice generation.<n>It consists of 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified.<n>Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java.
Score: 52.98970048197381
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models and agents have achieved remarkable progress in code generation. However, existing benchmarks focus on isolated function/class-level generation (e.g., ClassEval) or modifications to existing codebases (e.g., SWE-Bench), neglecting complete microservice repository generation that reflects real-world 0-to-1 development workflows. To bridge this gap, we introduce RepoGenesis, the first multilingual benchmark for repository-level end-to-end web microservice generation, comprising 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified through a "review-rebuttal" quality assurance process. We evaluate open-source agents (e.g., DeepCode) and commercial IDEs (e.g., Cursor) using Pass@1, API Coverage (AC), and Deployment Success Rate (DSR). Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java, exposing deficiencies in architectural coherence, dependency management, and cross-file consistency. Notably, GenesisAgent-8B, fine-tuned on RepoGenesis (train), achieves performance comparable to GPT-5 mini, demonstrating the quality of RepoGenesis for advancing microservice generation. We release our benchmark at https://github.com/microsoft/DKI_LLM/tree/main/RepoGenesis.

Related papers

RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing [1.4069797812477614]
We introduce a benchmarking framework for repository-level code modernization built on an implementation-agnostic evaluation paradigm.<n>RepoMod-Bench is a benchmark of 21 real-world repositories with standardized interfaces, spanning 8 languages.<n>The benchmark contains 1.6M lines of code (LOC) and 11,616 tests, with repository sizes ranging from 14 to 211K LOC.
arXiv Detail & Related papers (2026-02-26T01:25:00Z)
SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories [2.951332247539421]
We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects.<n>Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages.<n>Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages.
arXiv Detail & Related papers (2025-12-19T10:16:51Z)
A Multi-Language Object-Oriented Programming Benchmark for Large Language Models [61.267115598083315]
A survey of 35 existing benchmarks uncovers three major imbalances.<n>85.7% focus on a single programming language.<n>94.3% target only function-level or statement-level tasks.<n>Over 80% include fewer than ten test cases on average.
arXiv Detail & Related papers (2025-09-30T11:30:08Z)
GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git [0.8397730500554048]
GitGoodBench is a novel benchmark for evaluating AI agent performance on Version Control System (VCS) tasks.<n>Our benchmark covers three core Git scenarios extracted from open-source Python, Java, and Kotlin repositories.<n>We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall.
arXiv Detail & Related papers (2025-05-28T16:56:11Z)
EmbedAgent: Benchmarking Large Language Models in Embedded System Development [41.849233931919265]
Large Language Models (LLMs) have shown promise in various tasks, yet few benchmarks assess their capabilities in embedded system development.<n>We introduce EmbedAgent, a paradigm designed to simulate real-world roles in embedded system development.<n>We propose Embedbench, the first comprehensive benchmark for embedded system programming, circuit design, and cross-platform migration.
arXiv Detail & Related papers (2025-04-19T12:51:24Z)
EnvBench: A Benchmark for Automated Environment Setup [76.02998475135824]
Large Language Models have enabled researchers to focus on practical repository-level tasks in software engineering domain.<n>Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets.<n>To address this gap, we introduce a comprehensive environment setup benchmark EnvBench.
arXiv Detail & Related papers (2025-03-18T17:19:12Z)
RustRepoTrans: Repository-level Code Translation Benchmark Targeting Rust [50.65321080814249]
RustRepoTrans is the first repository-level context code translation benchmark targeting incremental translation.<n>We evaluate seven representative LLMs, analyzing their errors to assess limitations in complex translation scenarios.
arXiv Detail & Related papers (2024-11-21T10:00:52Z)
Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet' [9.48622608877252]
A number of repository-level code generation benchmarks have emerged to evaluate the capabilities of large language models (LLMs)<n>These benchmarks consist of short completions, synthetic examples, or focus on limited scale repositories, failing to represent real-world coding tasks.<n>We create REPOCOD, a Python code-generation benchmark containing complex tasks with realistic dependencies in real-world large projects.
arXiv Detail & Related papers (2024-10-29T01:21:05Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks.<n>Our framework supports multiple devices and can be easily extended to any environment with a Python interface.<n>The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)
RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation [79.83270415843857]
We introduce RepoAgent, a large language model powered open-source framework aimed at proactively generating, maintaining, and updating code documentation. We have validated the effectiveness of our approach, showing that RepoAgent excels in generating high-quality repository-level documentation.
arXiv Detail & Related papers (2024-02-26T15:39:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.