Related papers: SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

URL: http://arxiv.org/abs/2511.05459v3
Date: Wed, 12 Nov 2025 01:58:57 GMT
Title: SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
Authors: Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zongxian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Mengtong Li, Mengfei Xie, Xiaojiang Zhang, Jinghui Wang, Wenhao Zhuang, Zheng Lin, Huiming Wang, Zhaoxiang Zhang, Yuqun Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu,
Abstract summary: evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer.<n>We introduce SWE-1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework.<n>SWE- spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests.
Score: 59.90381306452982
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.

Related papers

Code Fingerprints: Disentangled Attribution of LLM-Generated Code [7.515488307576106]
We study the problem of model-level code attribution, which aims to determine the source LLM responsible for generated code.<n>We propose the Disentangled Code Attribution Network (DCAN), which separates Source-Agnostic semantic information from Source-Specific stylistic representations.<n>We construct the first large-scale benchmark dataset comprising code generated by four widely used Large Language Models (LLMs) across four programming languages.
arXiv Detail & Related papers (2026-03-04T15:58:36Z)
Multi-CoLoR: Context-Aware Localization and Reasoning across Multi-Language Codebases [1.4216413758677147]
We present Multi-CoLoR, a framework for Context-aware localization and reasoning across Multi-Languages.<n>It integrates organizational knowledge retrieval with graph-based reasoning to traverse complex software ecosystems.
arXiv Detail & Related papers (2026-02-23T00:54:59Z)
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence [150.3696990310269]
Large language models (LLMs) have transformed automated software development by enabling direct translation of natural language descriptions into functional code.<n>We provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs.<n>We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder)
arXiv Detail & Related papers (2025-11-23T17:09:34Z)
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z)
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System [56.40989626804489]
This survey provides the first holistic analysis of Large Language Models-powered software engineering.<n>We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair.
arXiv Detail & Related papers (2025-10-10T06:56:50Z)
Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes [33.80591142965565]
We present CODE2BENCH, a pipeline for dynamically constructing robust and contamination-resistant benchmarks from real-world GitHub repositories.<n>Specifically, CODE2BENCH introduces three key innovations: (1) Automated Dynamism, achieved through periodic ingestion of recent code to minimize training data contamination; (2) Scope Graph-based dependency analysis, which enables structured classification of functions into benchmark instances with controlled dependency levels; and (3) Property-Based Testing (PBT) for the automated synthesis of rigorous test suites.
arXiv Detail & Related papers (2025-08-10T05:06:36Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z)
Evaluating Large Language Models on Non-Code Software Engineering Tasks [4.381476817430934]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code understanding and generation.<n>We present the first comprehensive benchmark, which we name Software Engineering Language Understanding' (SELU)<n>SELU covers classification, regression, Named Entity Recognition (NER) and Masked Language Modeling (MLM) targets, with data drawn from diverse sources.
arXiv Detail & Related papers (2025-06-12T15:52:32Z)
SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation [5.880496520248658]
SIMCOPILOT is a benchmark that simulates the role of large language models (LLMs) as interactive, "copilot"-style coding assistants.<n>The benchmark comprises dedicated sub-benchmarks for Java (SIMCOPILOTJ) and Python.
arXiv Detail & Related papers (2025-05-21T04:59:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.