Fugu-MT 論文翻訳(概要): Automated Classification of Human Code Review Comments with Large Language Models

論文の概要: Automated Classification of Human Code Review Comments with Large Language Models

arxiv url: http://arxiv.org/abs/2604.23667v1
Date: Sun, 26 Apr 2026 12:07:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.490929
Title: Automated Classification of Human Code Review Comments with Large Language Models
Title（参考訳）: 大規模言語モデルを用いたヒューマンコードレビューコメントの自動分類
Authors: Semih Çağlar, Şükrü Eren Gökırmak, Eray Tüzün,
Abstract要約: 本研究の目的は、コードレビューコメントを特定のカテゴリの課題に応じて分類する自動システムの設計と評価である。コードレビューのコメントに9ラベルの分類を導入し、6つのレビューコメントの臭いと3つの有用な意図をカバーしました。 GPT-5-mini, LLaMA-3.3, DeepSeek-R1を比較し, 各コメントに対するゼロショットとワンショットのシングルラベル分類と関連する統合差分ハンクを比較した。
参考スコア（独自算出の注目度）: 1.4465033892011254
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Context: Code reviews are essential for maintaining software quality, yet many human review comments suffer from issues such as redundancy, vagueness, or lack of constructiveness. These types of comments may slow down feedback and obscure important insights. Prior work on code review comments mostly explore the detection and categorization of useful comments, while fine-grained categorization of comment issues remains underexplored. Objective: This work aims to design and evaluate an automated system for classifying code review comments according to specific categories of issues. Methodology: We introduced a nine-label taxonomy for code review comments, covering six review comment smells and three common useful intents, and manually labeled 448 comments from a publicly available dataset. We benchmarked zero-shot and one-shot single-label classification over each comment and its associated unified diff hunk, comparing GPT-5-mini, LLaMA-3.3, and DeepSeek-R1. We reported macro-F1 as the primary metric. Results: Zero-shot performance was moderate under class imbalance (macro-F1 0.360 to 0.374). One-shot exemplar conditioning had model-dependent effects: GPT-5-mini and DeepSeek-R1 macro-F1 scores improved, however LLaMA-3.3 suffered a slight decrease. Exemplars most consistently helped intent-boundary labels, whereas classification of evidence-sensitive labels remain challenging. Conclusion: Our results indicate that comment--diff evidence is sufficient for some labels but limited for evidence-sensitive smells. Future work includes adding thread context, improving intent-preserving rewrites, and validating robustness across platforms.
Abstract（参考訳）: コンテキスト: コードレビューはソフトウェアの品質を維持する上で不可欠ですが、人間のレビューコメントの多くは冗長性や曖昧さ、建設性の欠如といった問題に悩まされています。この種のコメントは、フィードバックを遅くし、不明瞭な重要な洞察を与える可能性がある。コードレビューのコメントに関する以前の作業は、有用なコメントの検出と分類を主に検討する一方で、詳細なコメントの分類については未調査のままである。目的: この研究は、特定のカテゴリの課題に応じてコードレビューコメントを分類するための自動システムの設計と評価を目的としています。方法論: コードレビューのコメントに9つのラベルの分類を導入し、6つのレビューコメントの臭いと3つの一般的な有用な意図をカバーし、公開データセットから448のコメントを手動でラベル付けしました。 GPT-5-mini, LLaMA-3.3, DeepSeek-R1を比較し, 各コメントに対するゼロショットとワンショットのシングルラベル分類と関連する統合差分ハンクを比較した。マクロF1を主指標として報告した。結果: ゼロショット性能はクラス不均衡(macro-F1 0.360 0.374)下で中等度であった。 GPT-5-miniとDeepSeek-R1のマクロF1スコアは改善されたが、LLaMA-3.3はわずかに低下した。例えは意図的境界ラベルを一貫して支援したが、証拠に敏感なラベルの分類は依然として困難である。結論: この結果から, コメント-差分証拠は一部のラベルには十分であるが, 証拠に敏感な匂いには限界があることが示唆された。今後の作業には、スレッドコンテキストの追加、インテント保存リライトの改善、プラットフォーム間の堅牢性検証などが含まれる。

論文の概要: Automated Classification of Human Code Review Comments with Large Language Models

関連論文リスト