Fugu-MT 論文翻訳(概要): BloomAPR: A Bloom's Taxonomy-based Framework for Assessing the Capabilities of LLM-Powered APR Solutions

論文の概要: BloomAPR: A Bloom's Taxonomy-based Framework for Assessing the Capabilities of LLM-Powered APR Solutions

arxiv url: http://arxiv.org/abs/2509.25465v1
Date: Mon, 29 Sep 2025 20:16:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.307968
Title: BloomAPR: A Bloom's Taxonomy-based Framework for Assessing the Capabilities of LLM-Powered APR Solutions
Title（参考訳）: BloomAPR: LLM駆動型APRソリューションの能力を評価するためのブルーム分類ベースのフレームワーク
Authors: Yinghang Ma, Jiho Shin, Leuson Da Silva, Zhen Ming, Jiang, Song Wang, Foutse Khomh, Shin Hwei Tan,
Abstract要約: ブルームの分類を基盤とした新しい動的評価フレームワークであるBloomAPRを紹介する。我々のフレームワークは、段階的に複雑な推論レベルを越えて、LLMによるAPRソリューションの認知能力を評価するための構造化されたアプローチを提供する。以上の結果から,これらのソリューションは基本的推論能力を示す一方で,合成されたバグによって性能が向上することが示唆された。
参考スコア（独自算出の注目度）: 19.682278660857584
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have accelerated the development of AI-driven automated program repair (APR) solutions. However, these solutions are typically evaluated using static benchmarks such as Defects4J and SWE-bench, which suffer from two key limitations: (1) the risk of data contamination, potentially inflating evaluation results due to overlap with LLM training data, and (2) limited ability to assess the APR capabilities in dynamic and diverse contexts. In this paper, we introduced BloomAPR, a novel dynamic evaluation framework grounded in Bloom's Taxonomy. Our framework offers a structured approach to assess the cognitive capabilities of LLM-powered APR solutions across progressively complex reasoning levels. Using Defects4J as a case study, we evaluated two state-of-the-art LLM-powered APR solutions, ChatRepair and CigaR, under three different LLMs: GPT-3.5-Turbo, Llama-3.1, and StarCoder-2. Our findings show that while these solutions exhibit basic reasoning skills and effectively memorize bug-fixing patterns (fixing up to 81.57% of bugs at the Remember layer), their performance increases with synthetically generated bugs (up to 60.66% increase at the Understand layer). However, they perform worse on minor syntactic changes (fixing up to 43.32% at the Apply layer), and they struggle to repair similar bugs when injected into real-world projects (solving only 13.46% to 41.34% bugs at the Analyze layer). These results underscore the urgent need for evolving benchmarks and provide a foundation for more trustworthy evaluation of LLM-powered software engineering solutions.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩は、AI駆動型自動プログラム修復(APR)ソリューションの開発を加速させている。しかし、これらのソリューションは典型的にはDefects4JやSWE-benchのような静的なベンチマークを用いて評価されるが、これは(1)データ汚染のリスク、LSMトレーニングデータとの重複による評価結果を膨らませる可能性、(2)動的かつ多様な文脈におけるAPR能力を評価する能力の制限、の2つの主要な制限がある。本稿では,Bloom's Taxonomyに基づく新しい動的評価フレームワークであるBloomAPRを紹介した。我々のフレームワークは、段階的に複雑な推論レベルを越えて、LLMによるAPRソリューションの認知能力を評価するための構造化されたアプローチを提供する。 Defects4Jをケーススタディとして,GPT-3.5-Turbo,Llama-3.1,StarCoder-2の3種類のLCMソリューションであるChatRepairとCigaRの評価を行った。その結果、これらのソリューションは基本的な推論スキルを示し、バグ修正パターンを効果的に記憶する(リマインダー層で最大81.57%のバグを修正)一方で、合成されたバグ(アンダースタンド層で最大60.66%増加)によってパフォーマンスが向上することがわかった。しかし、彼らは小さな構文変更(Apply層で43.32%の修正)で悪化し、現実世界のプロジェクトに注入された際に同様のバグを修復するのに苦労している(Analyze層では13.46%から41.34%のバグしか解決していない)。これらの結果は、ベンチマークの急激な必要性を強調し、LLMによるソフトウェアエンジニアリングソリューションをより信頼性の高い評価のための基盤を提供する。

論文の概要: BloomAPR: A Bloom's Taxonomy-based Framework for Assessing the Capabilities of LLM-Powered APR Solutions

関連論文リスト