Related papers: SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks

SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks

URL: http://arxiv.org/abs/2511.02352v2
Date: Wed, 05 Nov 2025 06:31:19 GMT
Title: SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks
Authors: Sanket Mhatre, Yasharth Bajpai, Sumit Gulwani, Emerson Murphy-Hill, Gustavo Soares,
Abstract summary: SWE-Sharp-Bench is a software engineering benchmark for C# featuring 150 instances from 17 repositories.<n>While 70% of Python tasks in SWE-Bench Verified are solved, only 40% of our C# tasks are resolved.<n>We open-source SWE-Sharp-Bench and our entire curation pipeline.
Score: 7.04771396439844
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: AI coding agents have shown great progress on Python software engineering benchmarks like SWE-Bench, and for other languages like Java and C in benchmarks like Multi-SWE-Bench. However, C# -- a prominent enterprise language ranking #5 in the TIOBE index -- remains absent from such benchmarks. We introduce SWE-Sharp-Bench, a reproducible software engineering benchmark for C# featuring 150 instances from 17 repositories. Evaluating identical model-agent configurations across languages reveals a significant performance gap: while 70% of Python tasks in SWE-Bench Verified are solved, only 40% of our C# tasks are resolved. We open-source SWE-Sharp-Bench and our entire curation pipeline.

Related papers

RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository [52.98970048197381]
RepoGenesis is the first multilingual benchmark for repository-level end-to-end web microservice generation.<n>It consists of 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified.<n>Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java.
arXiv Detail & Related papers (2026-01-20T13:19:20Z)
SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories [2.951332247539421]
We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects.<n>Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages.<n>Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages.
arXiv Detail & Related papers (2025-12-19T10:16:51Z)
A Multi-Language Object-Oriented Programming Benchmark for Large Language Models [61.267115598083315]
A survey of 35 existing benchmarks uncovers three major imbalances.<n>85.7% focus on a single programming language.<n>94.3% target only function-level or statement-level tasks.<n>Over 80% include fewer than ten test cases on average.
arXiv Detail & Related papers (2025-09-30T11:30:08Z)
GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git [0.8397730500554048]
GitGoodBench is a novel benchmark for evaluating AI agent performance on Version Control System (VCS) tasks.<n>Our benchmark covers three core Git scenarios extracted from open-source Python, Java, and Kotlin repositories.<n>We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall.
arXiv Detail & Related papers (2025-05-28T16:56:11Z)
SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents [49.73885480071402]
We introduce SWE-PolyBench, a new benchmark for repository-level, execution-based evaluation of coding agents.<n>SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java (165), JavaScript (1017), TypeScript (729) and Python (199), covering bug fixes, feature additions, and code.<n>Our experiments show that current agents exhibit uneven performances across languages and struggle with complex problems while showing higher performance on simpler tasks.
arXiv Detail & Related papers (2025-04-11T17:08:02Z)
EnvBench: A Benchmark for Automated Environment Setup [76.02998475135824]
Large Language Models have enabled researchers to focus on practical repository-level tasks in software engineering domain.<n>Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets.<n>To address this gap, we introduce a comprehensive environment setup benchmark EnvBench.
arXiv Detail & Related papers (2025-03-18T17:19:12Z)
EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking [58.15568681219339]
We introduce EquiBench, a new benchmark for evaluating large language models (LLMs)<n>This task directly tests a model's ability to reason about program semantics.<n>We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? [64.34184587727334]
We propose SWE-bench Multimodal to evaluate systems on their ability to fix bugs in visual, user-facing JavaScript software. SWE-bench M features 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping. Our analysis finds that top-performing SWE-bench systems struggle with SWE-bench M, revealing limitations in visual problem-solving and cross-language generalization.
arXiv Detail & Related papers (2024-10-04T18:48:58Z)
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java [27.226354754864783]
SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs) As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it.
arXiv Detail & Related papers (2024-08-26T15:30:05Z)
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems [43.797002322559834]
RepoBench is a benchmark for evaluating code auto-completion systems. It consists of three evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline)
arXiv Detail & Related papers (2023-06-05T17:59:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.