RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository
- URL: http://arxiv.org/abs/2601.13943v2
- Date: Thu, 22 Jan 2026 09:31:11 GMT
- Title: RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository
- Authors: Zhiyuan Peng, Xin Yin, Pu Zhao, Fangkai Yang, Lu Wang, Ran Jia, Xu Chen, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang,
- Abstract summary: RepoGenesis is the first multilingual benchmark for repository-level end-to-end web microservice generation.<n>It consists of 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified.<n>Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java.
- Score: 52.98970048197381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models and agents have achieved remarkable progress in code generation. However, existing benchmarks focus on isolated function/class-level generation (e.g., ClassEval) or modifications to existing codebases (e.g., SWE-Bench), neglecting complete microservice repository generation that reflects real-world 0-to-1 development workflows. To bridge this gap, we introduce RepoGenesis, the first multilingual benchmark for repository-level end-to-end web microservice generation, comprising 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified through a "review-rebuttal" quality assurance process. We evaluate open-source agents (e.g., DeepCode) and commercial IDEs (e.g., Cursor) using Pass@1, API Coverage (AC), and Deployment Success Rate (DSR). Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java, exposing deficiencies in architectural coherence, dependency management, and cross-file consistency. Notably, GenesisAgent-8B, fine-tuned on RepoGenesis (train), achieves performance comparable to GPT-5 mini, demonstrating the quality of RepoGenesis for advancing microservice generation. We release our benchmark at https://github.com/microsoft/DKI_LLM/tree/main/RepoGenesis.
Related papers
- RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing [1.4069797812477614]
We introduce a benchmarking framework for repository-level code modernization built on an implementation-agnostic evaluation paradigm.<n>RepoMod-Bench is a benchmark of 21 real-world repositories with standardized interfaces, spanning 8 languages.<n>The benchmark contains 1.6M lines of code (LOC) and 11,616 tests, with repository sizes ranging from 14 to 211K LOC.
arXiv Detail & Related papers (2026-02-26T01:25:00Z) - SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories [2.951332247539421]
We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects.<n>Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages.<n>Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages.
arXiv Detail & Related papers (2025-12-19T10:16:51Z) - A Multi-Language Object-Oriented Programming Benchmark for Large Language Models [61.267115598083315]
A survey of 35 existing benchmarks uncovers three major imbalances.<n>85.7% focus on a single programming language.<n>94.3% target only function-level or statement-level tasks.<n>Over 80% include fewer than ten test cases on average.
arXiv Detail & Related papers (2025-09-30T11:30:08Z) - GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git [0.8397730500554048]
GitGoodBench is a novel benchmark for evaluating AI agent performance on Version Control System (VCS) tasks.<n>Our benchmark covers three core Git scenarios extracted from open-source Python, Java, and Kotlin repositories.<n>We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall.
arXiv Detail & Related papers (2025-05-28T16:56:11Z) - EmbedAgent: Benchmarking Large Language Models in Embedded System Development [41.849233931919265]
Large Language Models (LLMs) have shown promise in various tasks, yet few benchmarks assess their capabilities in embedded system development.<n>We introduce EmbedAgent, a paradigm designed to simulate real-world roles in embedded system development.<n>We propose Embedbench, the first comprehensive benchmark for embedded system programming, circuit design, and cross-platform migration.
arXiv Detail & Related papers (2025-04-19T12:51:24Z) - EnvBench: A Benchmark for Automated Environment Setup [76.02998475135824]
Large Language Models have enabled researchers to focus on practical repository-level tasks in software engineering domain.<n>Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets.<n>To address this gap, we introduce a comprehensive environment setup benchmark EnvBench.
arXiv Detail & Related papers (2025-03-18T17:19:12Z) - RustRepoTrans: Repository-level Code Translation Benchmark Targeting Rust [50.65321080814249]
RustRepoTrans is the first repository-level context code translation benchmark targeting incremental translation.<n>We evaluate seven representative LLMs, analyzing their errors to assess limitations in complex translation scenarios.
arXiv Detail & Related papers (2024-11-21T10:00:52Z) - Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet' [9.48622608877252]
A number of repository-level code generation benchmarks have emerged to evaluate the capabilities of large language models (LLMs)<n>These benchmarks consist of short completions, synthetic examples, or focus on limited scale repositories, failing to represent real-world coding tasks.<n>We create REPOCOD, a Python code-generation benchmark containing complex tasks with realistic dependencies in real-world large projects.
arXiv Detail & Related papers (2024-10-29T01:21:05Z) - CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks.<n>Our framework supports multiple devices and can be easily extended to any environment with a Python interface.<n>The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z) - RepoAgent: An LLM-Powered Open-Source Framework for Repository-level
Code Documentation Generation [79.83270415843857]
We introduce RepoAgent, a large language model powered open-source framework aimed at proactively generating, maintaining, and updating code documentation.
We have validated the effectiveness of our approach, showing that RepoAgent excels in generating high-quality repository-level documentation.
arXiv Detail & Related papers (2024-02-26T15:39:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.