Fugu-MT 論文翻訳(概要): Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries

論文の概要: Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries

arxiv url: http://arxiv.org/abs/2605.24137v1
Date: Fri, 22 May 2026 18:55:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:17.644845
Title: Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries
Title（参考訳）: LLM生成バグにおける幻覚の実証解析と検出
Authors: Hinduja Nirujan, Shreyas Patil, Abdallah Ayoub, Ahmad Abdel Latif, Gouri Ginde,
Abstract要約: 大きな言語モデル(LLM)は、ソフトウェアバグレポートの要約を生成するためにますます使われています。 LLMは、説得力のある幻覚を生成するが、ソースレポートではサポートされない。これにより、開発者を誤解させ、自動メンテナンスツールへの信頼を減らすことができる。
参考スコア（独自算出の注目度）: 0.7666240799116113
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly used to generate summaries of software bug reports, including sections such as Steps-to-Reproduce (S2R), Actual Behavior (AB), and Expected Behavior (EB). However, these models frequently produce hallucinations that can be convincing but unsupported by the source report. This can mislead developers and reduce trust in automated maintenance tools. Existing hallucination detection approaches typically evaluate outputs at the full-response level and do not consider the structure of technical documents. An initial exploratory study on 80 structured bug report summaries found that approximately 47.9% contained missing information, while 12.3% included fabricated content, highlighting the need for systematic hallucination analysis in bug report summarization. In this work, we empirically investigate hallucinations in LLM-generated bug report summaries from a section-aware perspective. Using the BugsRepo dataset, derived from Mozilla OSS projects, we introduce controlled synthetic hallucination injection to construct a benchmark for training and evaluation. We propose a section-aware hallucination detection approach that jointly predicts whether a summary contains hallucinated content, identifies affected sections, and classifies hallucination types. Experimental results across multiple pretrained language models show that the proposed approach achieves strong performance across all tasks, with the best model obtaining 0.89 report-level Macro-F1, 0.83 section-level Macro-F1, and 0.84 hallucination-type Macro-F1. We further analyze common hallucination patterns and model failure modes to better understand limitations of current LLM-generated bug report summaries. The findings highlight the importance of section-aware hallucination analysis for improving the reliability of LLM-assisted bug report summarization in software maintenance workflows.
Abstract（参考訳）: 大規模言語モデル(LLM)は、ステップ・トゥ・リプロデューサ(S2R)、実際の振る舞い(AB)、期待された振舞い(EB)など、ソフトウェアバグレポートの要約を生成するためにますます使われています。しかし、これらのモデルはしばしば、説得力のある幻覚を産み出すが、情報源の報告では支持されない。これにより、開発者を誤解させ、自動メンテナンスツールへの信頼を減らすことができる。既存の幻覚検出手法は、通常、全応答レベルで出力を評価し、技術的文書の構造を考慮しない。 80の構造化バグレポートの要約に関する最初の調査では、約47.9%が欠落情報を含んでおり、12.3%が製造されたコンテンツを含んでおり、バグレポートの要約における体系的な幻覚分析の必要性を強調している。本研究では, LLM生成バグレポート要約における幻覚を, セクションアウェアの観点から実証的に検討する。 Mozilla OSSプロジェクトから派生したBugsRepoデータセットを用いて,学習と評価のためのベンチマークを構築するために,制御された合成幻覚注入を導入する。そこで,本研究では,要約に幻覚内容が含まれているか否かを共同で予測し,被害箇所を識別し,幻覚の種類を分類する部分認識幻覚検出手法を提案する。複数の事前訓練された言語モデルに対する実験結果から,提案手法は全てのタスクに対して高い性能を達成し,最高のモデルでは0.89のレポートレベル Macro-F1,0.83のセクションレベル Macro-F1,0.84の幻覚型 Macro-F1 が得られた。さらに,現在のLCM生成バグレポートの要約の限界をよりよく理解するために,一般的な幻覚パターンとモデル故障モードを分析した。本研究は,ソフトウェアメンテナンスワークフローにおけるLCM支援バグレポート要約の信頼性向上のための,部分認識幻覚解析の重要性を強調した。

論文の概要: Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries

関連論文リスト