Fugu-MT 論文翻訳(概要): OmniCode: A Benchmark for Evaluating Software Engineering Agents

論文の概要: OmniCode: A Benchmark for Evaluating Software Engineering Agents

arxiv url: http://arxiv.org/abs/2602.02262v1
Date: Mon, 02 Feb 2026 16:04:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:34.276791
Title: OmniCode: A Benchmark for Evaluating Software Engineering Agents
Title（参考訳）: OmniCode: ソフトウェアエンジニアリングエージェントを評価するベンチマーク
Authors: Atharv Sonwane, Eng-Shen Tu, Wei-Chung Lu, Claas Beger, Carter Larsen, Debjit Dhar, Rachel Chen, Ronit Pattanayak, Tuan Anh Dang, Guohao Chen, Gloria Geng, Kevin Ellis, Saikat Dutta,
Abstract要約: OmniCodeは、現実世界のソフトウェア開発のための新しいソフトウェアエンジニアリングベンチマークである。 3つのプログラミング言語(Python、Java、C++)にまたがる1794のタスクと、バグ修正、テスト生成、コードレビュー修正、スタイル修正の4つの主要なカテゴリを含んでいる。我々は、SWE-Agentのような人気のあるエージェントフレームワークでOmniCodeを評価し、Pythonのバグ修正でうまく機能するが、テスト生成のようなタスクやC++やJavaのような言語では不足していることを示す。
参考スコア（独自算出の注目度）: 12.695937079588402
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: LLM-powered coding agents are redefining how real-world software is developed. To drive the research towards better coding agents, we require challenging benchmarks that can rigorously evaluate the ability of such agents to perform various software engineering tasks. However, popular coding benchmarks such as HumanEval and SWE-Bench focus on narrowly scoped tasks such as competition programming and patch generation. In reality, software engineers have to handle a broader set of tasks for real-world software development. To address this gap, we propose OmniCode, a novel software engineering benchmark that contains a broader and more diverse set of task categories beyond code or patch generation. Overall, OmniCode contains 1794 tasks spanning three programming languages (Python, Java, and C++) and four key categories: bug fixing, test generation, code review fixing, and style fixing. In contrast to prior software engineering benchmarks, the tasks in OmniCode are (1) manually validated to eliminate ill-defined problems, and (2) synthetically crafted or recently curated to avoid data leakage issues, presenting a new framework for synthetically generating diverse software tasks from limited real-world data. We evaluate OmniCode with popular agent frameworks such as SWE-Agent and show that while they may perform well on bug fixing for Python, they fall short on tasks such as Test Generation and in languages such as C++ and Java. For instance, SWE-Agent achieves a maximum of 20.9% with DeepSeek-V3.1 on Java Test Generation tasks. OmniCode aims to serve as a robust benchmark and spur the development of agents that can perform well across different aspects of software development. Code and data are available at https://github.com/seal-research/OmniCode.
Abstract（参考訳）: LLMを利用したコーディングエージェントは、現実世界のソフトウェアの開発方法を再定義している。より優れたコーディングエージェントに向けた研究を進めるためには、このようなエージェントが様々なソフトウェアエンジニアリングタスクを実行する能力を厳格に評価できるような、挑戦的なベンチマークが必要である。しかし、HumanEvalやSWE-Benchのような一般的なコーディングベンチマークは、競合プログラミングやパッチ生成のような狭い範囲のタスクに焦点を当てている。現実には、ソフトウェアエンジニアは現実世界のソフトウェア開発のために幅広いタスクを処理しなければなりません。このギャップに対処するため、私たちはOmniCodeを提案します。OmniCodeは、コードやパッチ生成を超えて、より広く、より多様なタスクカテゴリを含む、新しいソフトウェアエンジニアリングベンチマークです。全体として、OmniCodeには3つのプログラミング言語(Python、Java、C++)にまたがる1794のタスクと、バグ修正、テスト生成、コードレビュー修正、スタイル修正の4つの主要なカテゴリが含まれている。従来のソフトウェアエンジニアリングベンチマークとは対照的に、OmniCodeのタスクは(1)不明確な問題を取り除くために手動で検証され、(2)データ漏洩の問題を避けるために合成または最近キュレーションされ、限られた実世界のデータから多様なソフトウェアタスクを合成的に生成する新しいフレームワークが提示される。我々は、SWE-Agentのような人気のあるエージェントフレームワークでOmniCodeを評価し、Pythonのバグ修正でうまく機能するが、テスト生成のようなタスクやC++やJavaのような言語では不足していることを示す。例えば、SWE-Agentは、Java Test GenerationタスクでDeepSeek-V3.1で最大20.9%を達成する。 OmniCodeは、堅牢なベンチマークとして機能し、ソフトウェア開発のさまざまな側面でうまく機能するエージェントの開発を促進することを目的としている。コードとデータはhttps://github.com/seal-research/OmniCodeで入手できる。

論文の概要: OmniCode: A Benchmark for Evaluating Software Engineering Agents

関連論文リスト