Fugu-MT 論文翻訳(概要): Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models

論文の概要: Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models

arxiv url: http://arxiv.org/abs/2508.10192v1
Date: Wed, 13 Aug 2025 20:55:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-15 22:24:48.118515
Title: Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models
Title（参考訳）: 大規模言語モデルにおける暗黙の幻覚と誤認識検出のためのプロンプト応答セマンティック・ディバージェンス・メトリクス
Authors: Igor Halperin,
Abstract要約: 本稿では, 忠実な幻覚を検出するための新しい枠組みであるセマンティック・ディバージェンス・メトリックス(SDM)を紹介する。プロンプトと応答間のトピック共起のヒートマップは、ユーザとマシンの対話の定量的な2次元可視化と見なすことができる。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The proliferation of Large Language Models (LLMs) is challenged by hallucinations, critical failure modes where models generate non-factual, nonsensical or unfaithful text. This paper introduces Semantic Divergence Metrics (SDM), a novel lightweight framework for detecting Faithfulness Hallucinations -- events of severe deviations of LLMs responses from input contexts. We focus on a specific implementation of these LLM errors, {confabulations, defined as responses that are arbitrary and semantically misaligned with the user's query. Existing methods like Semantic Entropy test for arbitrariness by measuring the diversity of answers to a single, fixed prompt. Our SDM framework improves upon this by being more prompt-aware: we test for a deeper form of arbitrariness by measuring response consistency not only across multiple answers but also across multiple, semantically-equivalent paraphrases of the original prompt. Methodologically, our approach uses joint clustering on sentence embeddings to create a shared topic space for prompts and answers. A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue. We then compute a suite of information-theoretic metrics to measure the semantic divergence between prompts and responses. Our practical score, $\mathcal{S}_H$, combines the Jensen-Shannon divergence and Wasserstein distance to quantify this divergence, with a high score indicating a Faithfulness hallucination. Furthermore, we identify the KL divergence KL(Answer $||$ Prompt) as a powerful indicator of \textbf{Semantic Exploration}, a key signal for distinguishing different generative behaviors. These metrics are further combined into the Semantic Box, a diagnostic framework for classifying LLM response types, including the dangerous, confident confabulation.
Abstract（参考訳）: LLM(Large Language Models)の拡散は、非現実的、非感覚的、あるいは不誠実なテキストを生成する、幻覚、致命的な失敗モードによって挑戦される。本稿では,SDM(Semantic Divergence Metrics)について紹介する。我々は、これらのLCMエラーの特定の実装である {confabulations" に注目し、ユーザのクエリと任意で意味的に一致しない応答として定義する。セマンティックエントロピーテストのような既存の手法は、一つの固定されたプロンプトに対する答えの多様性を測定することによって任意性をテストする。我々は、複数の答えだけでなく、元のプロンプトの複数の意味論的に等価なパラフレーズにわたって応答の一貫性を測定することによって、より深い形の仲裁性をテストする。提案手法では,文の埋め込みに共同クラスタリングを用いて,プロンプトと回答のための共有トピック空間を作成する。プロンプトと応答間のトピック共起のヒートマップは、ユーザとマシンの対話の定量的な2次元可視化と見なすことができる。次に、情報理論の一連のメトリクスを計算し、プロンプトと応答のセマンティックなばらつきを測定する。我々の実践的スコアである$\mathcal{S}_H$は、ジェンセン=シャノンの発散とワッサーシュタイン距離を組み合わせてこの発散を定量化し、高得点は忠実な幻覚を示す。さらに、KLの発散KL(Answer $|$ Prompt)を、異なる生成挙動を識別するためのキーシグナルである \textbf{Semantic Exploration} の強力な指標として同定する。これらのメトリクスは、危険で確実なコミュニケーションを含むLSM応答タイプを分類するための診断フレームワークであるSemantic Boxにさらに統合される。

論文の概要: Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models

関連論文リスト