Fugu-MT 論文翻訳(概要): Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

論文の概要: Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

arxiv url: http://arxiv.org/abs/2511.07017v1
Date: Mon, 10 Nov 2025 12:06:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-11 21:18:45.233212
Title: Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice
Title（参考訳）: 豊富なコンテキストを実践したコードレビューのためのLLMのベンチマーク
Authors: Ruida Hu, Xinchen Wang, Xin-Cheng Wen, Zhao Zhang, Bo Jiang, Pengfei Gao, Chao Peng, Cuiyun Gao,
Abstract要約: ContextCRBenchは、コードレビューにおける詳細なLCM評価のためのベンチマークである。 153.7Kのイシューとトップレベルのリポジトリからのプルリクエストを収集する。レビューワークフローに沿った3つの評価シナリオをサポートする。
参考スコア（独自算出の注目度）: 18.222990693059756
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in automating this process. However, existing benchmarks for LLM-based code review face three major limitations. (1) Lack of semantic context: most benchmarks provide only code diffs without textual information such as issue descriptions, which are crucial for understanding developer intent. (2) Data quality issues: without rigorous validation, many samples are noisy-e.g., reviews on outdated or irrelevant code-reducing evaluation reliability. (3) Coarse granularity: most benchmarks operate at the file or commit level, overlooking the fine-grained, line-level reasoning essential for precise review. We introduce ContextCRBench, a high-quality, context-rich benchmark for fine-grained LLM evaluation in code review. Our construction pipeline comprises: (1) Raw Data Crawling, collecting 153.7K issues and pull requests from top-tier repositories; (2) Comprehensive Context Extraction, linking issue-PR pairs for textual context and extracting the full surrounding function or class for code context; and (3) Multi-stage Data Filtering, combining rule-based and LLM-based validation to remove outdated, malformed, or low-value samples, resulting in 67,910 context-enriched entries. ContextCRBench supports three evaluation scenarios aligned with the review workflow: (1) hunk-level quality assessment, (2) line-level defect localization, and (3) line-level comment generation. Evaluating eight leading LLMs (four closed-source and four open-source) reveals that textual context yields greater performance gains than code context alone, while current LLMs remain far from human-level review ability. Deployed at ByteDance, ContextCRBench drives a self-evolving code review system, improving performance by 61.98% and demonstrating its robustness and industrial utility.
Abstract（参考訳）: コードレビューは、ソフトウェアの品質保証の基盤であり、最近のLarge Language Models(LLMs)の進歩は、このプロセスを自動化することを約束している。しかし、LLMベースのコードレビューの既存のベンチマークは、3つの大きな制限に直面している。 1)セマンティックコンテキストの欠如:ほとんどのベンチマークは、問題記述などのテキスト情報のないコード差分しか提供しないが、これは開発者の意図を理解するのに不可欠である。 2) データ品質の問題:厳格な検証なしに、多くのサンプルはノイズの多い例である。 (3) 粗い粒度: ほとんどのベンチマークはファイルやコミットレベルで動作し、正確なレビューに不可欠なきめ細かいラインレベルの推論を見渡せる。コードレビューにおいて、精細なLCM評価のための高品質でコンテキストに富んだベンチマークであるContextCRBenchを紹介する。 1)データクローリング,153.7Kイシューの収集,トップ層リポジトリからのプルリクエストの収集,(2)コンテキスト抽出,テキストコンテキスト用のイシュー-PRペアのリンク,コードコンテキスト用のフル周辺関数やクラス抽出,(3)ルールベースとLLMベースのバリデーションを組み合わせた多段階データフィルタリングにより,古い,不正な,あるいは低値のサンプルを削除し,67,910のコンテキスト豊富なエントリが生成される。 ContextCRBenchは、(1)ハンクレベルの品質評価、(2)ラインレベルの欠陥ローカライゼーション、(3)ラインレベルのコメント生成の3つの評価シナリオをサポートする。 8つの主要なLLM(4つのクローズドソースと4つのオープンソース)を評価すると、テキストコンテキストはコードコンテキスト単独よりもパフォーマンスが向上する一方、現在のLLMは人間レベルのレビュー能力から遠ざかっている。 ByteDanceでデプロイされたContextCRBenchは、自己進化的なコードレビューシステムを駆動し、パフォーマンスを61.98%向上し、堅牢性と産業的有用性を実証している。

論文の概要: Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

関連論文リスト