Fugu-MT 論文翻訳(概要): LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

論文の概要: LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

arxiv url: http://arxiv.org/abs/2604.25665v1
Date: Tue, 28 Apr 2026 14:00:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.890822
Title: LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation
Title（参考訳）: LLM-ReSum: 自己評価によるLLM反射的要約のためのフレームワーク
Authors: Huyen Nguyen, Haoxuan Zhang, Yang Zhang, Junhua Ding, Haihua Chen,
Abstract要約: 7つのデータセットにまたがる14の自動要約メトリクスとLLMに基づく評価器の総合メタ評価を行う。その結果,従来の語彙重なりの指標(ROUGE,BLEUなど)は,人間の判断と弱いあるいは負の相関を示すことがわかった。 LLMに基づく評価と生成をモデル微調整なしでクローズドフィードバックループで統合する自己反射的要約フレームワーク LLM-ReSum を提案する。
参考スコア（独自算出の注目度）: 5.106530060248491
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.
Abstract（参考訳）: 大規模言語モデル(LLM)の生成した要約の信頼性評価は、特に異種ドメインや文書の長さにおいて未解決の課題である。我々は,5つのドメインにまたがる7つのデータセットにまたがる14の自動要約指標とLLMに基づく評価器を包括的にメタ評価する。その結果,従来の語彙重なりの指標(例えばROUGE,BLEU)は人間の判断と弱いあるいは負の相関を示す一方で,タスク特異的ニューラルネットワークとLLMに基づく評価器は,特に言語的品質評価において著しく高いアライメントを達成していることがわかった。そこで本研究では,LLMに基づく評価と生成をモデル微調整なしでクローズドフィードバックループに統合した自己回帰的要約フレームワーク LLM-ReSum を提案する。 3つの領域にわたって、LLM-ReSumは、実際の精度で最大33%、カバー範囲で最大39%改善し、人間の評価者は89%のケースで洗練されたサマリーを好む。また,180名の専門家による要約を含む,新たな人手による法的文書要約のベンチマークであるPatentSumEvalを紹介する。すべてのコードとデータセットはGitHubでリリースされる。

論文の概要: LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

関連論文リスト