Fugu-MT 論文翻訳(概要): FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration

論文の概要: FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration

arxiv url: http://arxiv.org/abs/2510.04852v1
Date: Mon, 06 Oct 2025 14:39:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.905223
Title: FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
Title（参考訳）: FreshBrew: JavaコードマイグレーションにおけるAIエージェントの評価ベンチマーク
Authors: Victor May, Diganta Misra, Yanqi Luo, Anjali Sridhar, Justine Gehring, Silvio Soares Ribeiro Junior,
Abstract要約: 我々は、プロジェクトレベルのJavaマイグレーションでAIエージェントを評価するための新しいベンチマークであるFreshBrewを紹介する。我々は、いくつかの最先端のLCMをベンチマークし、それらの性能を既存のルールベースのツールと比較する。 228リポジトリのこのベンチマークにおけるAIエージェントの評価は、最高のパフォーマンスモデルである2.5 Gemini Flashがプロジェクトの52.3%を17.5%に移行できることを示している。
参考スコア（独自算出の注目度）: 2.981397088242044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative-but their effectiveness has not been systematically evaluated. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI agents on project-level Java migrations, with a specific focus on measuring an agent's ability to preserve program semantics and avoid reward hacking, which we argue requires projects with high test coverage for a rigorous and reliable evaluation. We benchmark several state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 52.3 percent of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. Our empirical study reveals failure modes of current AI agents in realistic Java modernization tasks, providing a foundation for evaluating trustworthy code-migration systems. By releasing FreshBrew, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization.
Abstract（参考訳）: AIコーディングアシスタントは、現代のソフトウェア開発に急速に不可欠なものになりつつある。この領域における重要な課題は、進化するソフトウェアエコシステムに対応するためにコードベースの移行と近代化を継続する必要性である。伝統的に、このような移住はルールベースのシステムと人間の介入に依存してきた。強力な大規模言語モデル(LLM)の出現に伴い、AI駆動のエージェントフレームワークは有望な代替手段を提供するが、その効果は体系的に評価されていない。本稿では、プロジェクトレベルのJavaマイグレーションにおいてAIエージェントを評価するための新しいベンチマークであるFreshBrewを紹介し、プログラムのセマンティクスを保ち、報酬のハッキングを避けるためにエージェントの能力を測定することに重点を置いている。我々は、いくつかの最先端のLCMをベンチマークし、それらの性能を既存のルールベースのツールと比較する。 228リポジトリのこのベンチマークにおけるAIエージェントの評価は、最高のパフォーマンスモデルであるGemini 2.5 Flashが、プロジェクトの52.3%をJDK 17に移行することに成功したことを示している。私たちの経験的分析は、現在のエージェント的アプローチの強みと限界に対する新たな洞察を明らかにし、現実の応用性に関する実用的な洞察を提供します。私たちの実証的研究は、現実的なJava近代化タスクにおける現在のAIエージェントの障害モードを明らかにし、信頼できるコード移行システムを評価する基盤を提供します。 FreshBrewのリリースにより、厳密で再現可能な評価を容易にし、AI駆動のコードベースの近代化の進展を促進することを目指している。

論文の概要: FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration

関連論文リスト