SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks
- URL: http://arxiv.org/abs/2511.02352v2
- Date: Wed, 05 Nov 2025 06:31:19 GMT
- Title: SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks
- Authors: Sanket Mhatre, Yasharth Bajpai, Sumit Gulwani, Emerson Murphy-Hill, Gustavo Soares,
- Abstract summary: SWE-Sharp-Bench is a software engineering benchmark for C# featuring 150 instances from 17 repositories.<n>While 70% of Python tasks in SWE-Bench Verified are solved, only 40% of our C# tasks are resolved.<n>We open-source SWE-Sharp-Bench and our entire curation pipeline.
- Score: 7.04771396439844
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: AI coding agents have shown great progress on Python software engineering benchmarks like SWE-Bench, and for other languages like Java and C in benchmarks like Multi-SWE-Bench. However, C# -- a prominent enterprise language ranking #5 in the TIOBE index -- remains absent from such benchmarks. We introduce SWE-Sharp-Bench, a reproducible software engineering benchmark for C# featuring 150 instances from 17 repositories. Evaluating identical model-agent configurations across languages reveals a significant performance gap: while 70% of Python tasks in SWE-Bench Verified are solved, only 40% of our C# tasks are resolved. We open-source SWE-Sharp-Bench and our entire curation pipeline.
Related papers
- RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository [52.98970048197381]
RepoGenesis is the first multilingual benchmark for repository-level end-to-end web microservice generation.<n>It consists of 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified.<n>Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java.
arXiv Detail & Related papers (2026-01-20T13:19:20Z) - SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories [2.951332247539421]
We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects.<n>Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages.<n>Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages.
arXiv Detail & Related papers (2025-12-19T10:16:51Z) - A Multi-Language Object-Oriented Programming Benchmark for Large Language Models [61.267115598083315]
A survey of 35 existing benchmarks uncovers three major imbalances.<n>85.7% focus on a single programming language.<n>94.3% target only function-level or statement-level tasks.<n>Over 80% include fewer than ten test cases on average.
arXiv Detail & Related papers (2025-09-30T11:30:08Z) - GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git [0.8397730500554048]
GitGoodBench is a novel benchmark for evaluating AI agent performance on Version Control System (VCS) tasks.<n>Our benchmark covers three core Git scenarios extracted from open-source Python, Java, and Kotlin repositories.<n>We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall.
arXiv Detail & Related papers (2025-05-28T16:56:11Z) - SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents [49.73885480071402]
We introduce SWE-PolyBench, a new benchmark for repository-level, execution-based evaluation of coding agents.<n>SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java (165), JavaScript (1017), TypeScript (729) and Python (199), covering bug fixes, feature additions, and code.<n>Our experiments show that current agents exhibit uneven performances across languages and struggle with complex problems while showing higher performance on simpler tasks.
arXiv Detail & Related papers (2025-04-11T17:08:02Z) - EnvBench: A Benchmark for Automated Environment Setup [76.02998475135824]
Large Language Models have enabled researchers to focus on practical repository-level tasks in software engineering domain.<n>Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets.<n>To address this gap, we introduce a comprehensive environment setup benchmark EnvBench.
arXiv Detail & Related papers (2025-03-18T17:19:12Z) - EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking [58.15568681219339]
We introduce EquiBench, a new benchmark for evaluating large language models (LLMs)<n>This task directly tests a model's ability to reason about program semantics.<n>We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z) - SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? [64.34184587727334]
We propose SWE-bench Multimodal to evaluate systems on their ability to fix bugs in visual, user-facing JavaScript software.
SWE-bench M features 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping.
Our analysis finds that top-performing SWE-bench systems struggle with SWE-bench M, revealing limitations in visual problem-solving and cross-language generalization.
arXiv Detail & Related papers (2024-10-04T18:48:58Z) - SWE-bench-java: A GitHub Issue Resolving Benchmark for Java [27.226354754864783]
SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs)
As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java.
To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it.
arXiv Detail & Related papers (2024-08-26T15:30:05Z) - RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems [43.797002322559834]
RepoBench is a benchmark for evaluating code auto-completion systems.
It consists of three evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline)
arXiv Detail & Related papers (2023-06-05T17:59:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.