Fugu-MT 論文翻訳(概要): Code Review Agent Benchmark

論文の概要: Code Review Agent Benchmark

arxiv url: http://arxiv.org/abs/2603.23448v2
Date: Mon, 30 Mar 2026 14:02:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 13:48:18.797004
Title: Code Review Agent Benchmark
Title（参考訳）: コードレビューエージェントベンチマーク
Authors: Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf, Haifeng Ruan, Ridwan Shariffdeen, Abhik Roychoudhury,
Abstract要約: AIエージェントが動作するためのコードレビューデータセットをキュレートします。評価フレームワークは、コードレビューエージェントのレビュー機能を評価することができる。これがコード生成エージェント、テスト生成エージェント、コードレビューエージェントの今後のコラボレーションにとって何を意味するのかは、まだ調査されていない。
参考スコア（独自算出の注目度）: 11.281773600529212
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and generate huge volumes of code automatically -- the matter of code quality comes front and centre. As the automatically generated code gets integrated into huge code-bases -- the issue of code review and broadly quality assurance becomes important. In this paper, we take a fresh look at the problem and curate a code review dataset for AI agents to work with. Our dataset called c-CRAB (pronounced see-crab) can evaluate agents for code review tasks. Specifically given a pull-request (which could be coming from code generation agents or humans), if a code review agent produces a review, our evaluation framework can asses the reviewing capability of the code review agents. Our evaluation framework is used to evaluate the state of the art today -- the open-source PR-agent, as well as commercial code review agents from Devin, Claude Code, and Codex. Our c-CRAB dataset is systematically constructed from human reviews -- given a human review of a pull request instance we generate corresponding tests to evaluate the code review agent generated reviews. Such a benchmark construction gives us several insights. Firstly, the existing review agents taken together can solve only around 40% of the c-CRAB tasks, indicating the potential to close this gap by future research. Secondly, we observe that the agent reviews often consider different aspects from the human reviews -- indicating the potential for human-agent collaboration for code review that could be deployed in future software teams. Last but not the least, the agent generated tests from our data-set act as a held out test-suite and hence quality gate for agent generated reviews. What this will mean for future collaboration of code generation agents, test generation agents and code review agents -- remains to be investigated.
Abstract（参考訳）: ソフトウェアエンジニアリングエージェントは、コードを書くことに大きな可能性を示しています。 AIエージェントがコード記述を透過し、大量のコードを自動的に生成するので、コード品質の問題が前面と中央に現れます。自動生成されたコードが巨大なコードベースに統合されるにつれて、コードレビューと幅広い品質保証の問題が重要になります。本稿では,この問題を新たに検討し,AIエージェントが連携するためのコードレビューデータセットをキュレートする。 c-CRAB(See-crab)と呼ばれるデータセットは、コードレビュータスクのエージェントを評価することができる。特にプルリクエスト(コード生成エージェントや人間から来る可能性がある)を前提として、コードレビューエージェントがレビューを生成すると、私たちの評価フレームワークは、コードレビューエージェントのレビュー能力を評価することができます。私たちの評価フレームワークは、現在最先端のPRエージェントであるDevin、Claude Code、Codexの商用コードレビューエージェントの評価に使用しています。 c-CRABデータセットは、人によるレビューから体系的に構築されます -- プルリクエストインスタンスの人間によるレビューを考慮して、対応するテストを生成して、コードレビューエージェントが生成したレビューを評価します。このようなベンチマーク構成はいくつかの洞察を与えてくれます。まず、既存のレビューエージェントは、c-CRABタスクの約40%しか解決できず、将来の研究でこのギャップを埋める可能性を示唆している。第2に、エージェントレビューが人間のレビューと異なる側面を考慮していることが、将来のソフトウェアチームにデプロイされる可能性のあるコードレビューのための人間とエージェントのコラボレーションの可能性を示しています。最後に、エージェントは、保持されたテストスーツとしてデータセットの動作からテストを生成し、したがってエージェントが生成したレビューの品質ゲートを生成しました。これがコード生成エージェント、テスト生成エージェント、コードレビューエージェントの今後のコラボレーションにとって何を意味するのかは、まだ調査されていない。

論文の概要: Code Review Agent Benchmark

関連論文リスト