Fugu-MT 論文翻訳(概要): Understanding the Limits of Automated Evaluation for Code Review Bots in Practice

論文の概要: Understanding the Limits of Automated Evaluation for Code Review Bots in Practice

arxiv url: http://arxiv.org/abs/2604.24525v1
Date: Mon, 27 Apr 2026 14:25:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.082242
Title: Understanding the Limits of Automated Evaluation for Code Review Bots in Practice
Title（参考訳）: コードレビューボットにおける自動評価の限界を理解する
Authors: Veli Karakaya, Utku Boran Torun, Baykal Mehmet Uçar, Eray Tüzün,
Abstract要約: 我々は、2,604のボット生成PRコメントの産業データセットを分析し、それぞれがソフトウェアエンジニアによって固定/置換Fixとしてラベル付けされている。 G-Eval と LLM-as-a-Judge パイプラインという2つの自動評価手法をバイナリ決定と 0-4 Likert-scale の定式化の両方を用いて適用した。どちらの評価戦略も、人間のラベルとの適度なアライメントしか達成していない。
参考スコア（独自算出の注目度）: 1.3241176321860364
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Automated code review (ACR) bots are increasingly used in industrial software development to assist developers during pull request (PR) review. As adoption grows, a key challenge is how to evaluate the usefulness of bot-generated comments reliably and at scale. In practice, such evaluation often relies on developer actions and annotations that are shaped by contextual and organizational factors, complicating their use as objective ground truth. We examine the feasibility and limitations of automating the evaluation of LLM-powered ACR bots in an industrial setting. We analyze an industrial dataset from Beko comprising 2,604 bot-generated PR comments, each labeled by software engineers as fixed/wontFix. Two automated evaluation approaches, G-Eval and an LLM-as-a-Judge pipeline, are applied using both binary decisions and a 0-4 Likert-scale formulation, enabling a controlled comparison against developer-provided labels. Across Gemini-2.5-pro, GPT-4.1-mini, and GPT-5.2, both evaluation strategies achieve only moderate alignment with human labels. Agreement ratios range from approximately 0.44 to 0.62, with noticeable variation across models and between binary and Likert-scale formulations, indicating sensitivity to both model choice and evaluation design. Our findings highlight practical limitations in fully automating the evaluation of ACR bot comments in industrial contexts. Developer actions such as resolving or ignoring comments reflect not only comment quality, but also contextual constraints, prioritization decisions, and workflow dynamics that are difficult to capture through static artifacts. Insights from a follow-up interview with a software engineering director further corroborate that developer labeling behavior is strongly influenced by workflow pressures and organizational constraints, reinforcing the challenges of treating such signals as objective ground truth.
Abstract（参考訳）: 自動コードレビュー(ACR)ボットは、プルリクエスト(PR)レビュー中の開発者を支援するために、産業ソフトウェア開発でますます利用されている。採用が進むにつれ、ボット生成コメントの有用性を、大規模かつ確実に評価する上で重要な課題となる。実際には、このような評価はしばしば、文脈的および組織的要因によって形成された開発者アクションやアノテーションに依存し、客観的な基礎的真実としての使用を複雑にしている。産業環境でのLCM駆動型ACRボットの評価の自動化の実現可能性と限界について検討する。 2,604件のボット生成PRコメントからなるBekoの産業データセットを分析し、それぞれがソフトウェアエンジニアによって固定/無効Fixとしてラベル付けされている。 G-Eval と LLM-as-a-Judge パイプラインという2つの自動評価手法をバイナリ決定と 0-4 Likert-scale の定式化の両方を用いて適用し,開発者が提供するラベルとの制御比較を可能にする。 Gemini-2.5-pro、GPT-4.1-mini、GPT-5.2の2つの評価戦略は、ヒトのラベルとの中間的なアライメントしか達成していない。コンセンサス比はおよそ0.44から0.62の範囲で、モデル間での顕著な変動と、モデル選択と評価設計の両方に対する感度を示す二項式と類似のスケールの定式化がある。本研究は,産業環境下でのACRボットコメントの評価を完全自動化する実践的制限を強調した。コメントの解決や無視といった開発者のアクションは、コメントの品質だけでなく、コンテキスト制約、優先順位決定、静的アーティファクトをキャプチャするのが難しいワークフローのダイナミクスも反映している。ソフトウェアエンジニアリングディレクタとのフォローアップインタビューからの洞察は、開発者のラベリング行動がワークフローのプレッシャーや組織的制約に強く影響し、そのようなシグナルを客観的な根拠の真実として扱うことの課題を補強する、ということをさらに裏付けるものだ。

論文の概要: Understanding the Limits of Automated Evaluation for Code Review Bots in Practice

関連論文リスト