Fugu-MT 論文翻訳(概要): SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

論文の概要: SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

arxiv url: http://arxiv.org/abs/2511.05459v1
Date: Fri, 07 Nov 2025 18:01:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-10 21:00:44.853422
Title: SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
Title（参考訳）: SWE-Compass:大規模言語モデルのためのエージェント符号化能力の統一評価に向けて
Authors: Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zongxian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Xiaojiang Zhang, Jinghui Wang, Huiming Wang, Wenhao Zhuang, Zhaoxiang Zhang, Yuqun Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu,
Abstract要約: ソフトウェアエンジニアリングのための大規模言語モデル(LLM)の評価は、タスクカバレッジの狭さ、言語バイアス、現実世界の開発者との整合性の不足によって制限されている。 SWE-1は、不均一なコード関連評価を構造化および生産整合性のあるフレームワークに統合する包括的なベンチマークである。 SWE-は8つのタスクタイプ、8つのプログラミングシナリオ、10のプログラミング言語にまたがる。
参考スコア（独自算出の注目度）: 56.67819005872266
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.
Abstract（参考訳）: ソフトウェアエンジニアリングのための大規模言語モデル(LLM)の評価は、タスクカバレッジの狭さ、言語バイアス、現実世界の開発者ワークフローとの整合性の不足によって制限されている。既存のベンチマークはアルゴリズム上の問題やPython中心のバグ修正に重点を置いており、ソフトウエアエンジニアリングの重要な側面を過小評価している。 SWE-Compass1は、不均一なコード関連評価を構造化および生産整合性のあるフレームワークに統合する包括的なベンチマークである。 SWE-Compassは8つのタスクタイプ、8つのプログラミングシナリオ、10のプログラミング言語にまたがる。我々は、SWE-AgentとClaude Codeという2つのエージェントフレームワークの下で10の最先端のLCMをベンチマークし、タスクタイプ、言語、シナリオ間の難易度の明確な階層構造を明らかにした。さらに、SWE-Compassは、実世界の開発者の実践と評価を整合させることで、大規模言語モデルにおけるエージェントコーディング機能の診断と進歩のための厳密で再現可能な基盤を提供する。

論文の概要: SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

関連論文リスト