Fugu-MT 論文翻訳(概要): Evaluating Large Language Models for Code Review

論文の概要: Evaluating Large Language Models for Code Review

arxiv url: http://arxiv.org/abs/2505.20206v1
Date: Mon, 26 May 2025 16:47:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-27 19:27:27.017759
Title: Evaluating Large Language Models for Code Review
Title（参考訳）: コードレビューのための大規模言語モデルの評価
Authors: Umut Cihan, Arda İçöz, Vahid Haratian, Eray Tüzün,
Abstract要約: GPT4oとGemini 2.0 Flashを492 AIでテストしました。 GPT4o と Gemini 2.0 Flash はそれぞれ68.50% と63.89% のコード正当性を正しく分類し、67.83% と54.26% のコード正当性を修正した。
参考スコア（独自算出の注目度）: 2.0261749670612637
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Context: Code reviews are crucial for software quality. Recent AI advances have allowed large language models (LLMs) to review and fix code; now, there are tools that perform these reviews. However, their reliability and accuracy have not yet been systematically evaluated. Objective: This study compares different LLMs' performance in detecting code correctness and suggesting improvements. Method: We tested GPT4o and Gemini 2.0 Flash on 492 AI generated code blocks of varying correctness, along with 164 canonical code blocks from the HumanEval benchmark. To simulate the code review task objectively, we expected LLMs to assess code correctness and improve the code if needed. We ran experiments with different configurations and reported on the results. Results: With problem descriptions, GPT4o and Gemini 2.0 Flash correctly classified code correctness 68.50% and 63.89% of the time, respectively, and corrected the code 67.83% and 54.26% of the time for the 492 code blocks of varying correctness. Without problem descriptions, performance declined. The results for the 164 canonical code blocks differed, suggesting that performance depends on the type of code. Conclusion: LLM code reviews can help suggest improvements and assess correctness, but there is a risk of faulty outputs. We propose a process that involves humans, called the "Human in the loop LLM Code Review" to promote knowledge sharing while mitigating the risk of faulty outputs.
Abstract（参考訳）: コンテキスト: コードレビューはソフトウェアの品質に不可欠です。最近のAIの進歩により、大規模な言語モデル(LLM)がコードのレビューと修正を可能にしている。しかし、その信頼性と正確性はまだ体系的に評価されていない。目的: この研究は、コード正しさの検出と改善の提案において、異なるLLMのパフォーマンスを比較します。方法: GPT4o と Gemini 2.0 Flash を 492 AI でテストした。コードレビュータスクを客観的にシミュレートするために、LLMがコードの正確性を評価し、必要に応じてコードを改善することを期待した。異なる構成で実験を行い、その結果を報告しました。結果: GPT4o と Gemini 2.0 Flash は、それぞれ68.50% と63.89% のコード正しさを正しく分類し、67.83% と54.26% のコード正しさを 492 のコードブロックに対して修正した。問題の説明がなければ、パフォーマンスは低下した。 164の標準コードブロックの結果は異なっており、パフォーマンスはコードの種類に依存している。結論: LLMのコードレビューは、改善の提案と正確性評価に役立つが、欠陥のあるアウトプットのリスクがある。本研究では,「ループ中Human in the loop LLM Code Review」と呼ばれる人為的プロセスを提案し,欠陥出力のリスクを軽減しつつ知識共有を促進する。

論文の概要: Evaluating Large Language Models for Code Review

関連論文リスト