Fugu-MT 論文翻訳(概要): Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

論文の概要: Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

arxiv url: http://arxiv.org/abs/2510.18439v1
Date: Tue, 21 Oct 2025 09:13:46 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:13.232501
Title: Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation
Title（参考訳）: 接地・誘導?手話翻訳における幻覚検出のための視覚信号
Authors: Yasser Hamidullah, Koel Dutta Chowdury, Yusser Al-Ghussin, Shakib Yazdani, Cennet Oguz, Josef van Genabith, Cristina España-Bonet,
Abstract要約: 幻覚は視覚言語モデルの重大な欠陥であり、手話翻訳において特に重要である。本稿では,デコーダの視覚情報利用量を定量化するトークンレベルの信頼性尺度を提案する。以上の結果から、信頼性は幻覚率を予測し、データセットやアーキテクチャをまたいで一般化し、視覚的劣化の下で低下することが示された。
参考スコア（独自算出の注目度）: 13.03365340564181
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.
Abstract（参考訳）: 幻覚は、視覚的証拠によって浮かび上がらないテキストを生成するが、視覚言語モデルに大きな欠陥であり、手話翻訳(SLT)において特に重要である。 SLTでは、ビデオの正確なグラウンド化に依存しており、特にグロスフリーモデルは、アライメントとして機能する中間光沢の監督なしに、連続シグナーの動きを直接自然言語にマッピングするため、特に脆弱である。私たちは、モデルが視覚的な入力よりも言語優先に依存しているときに幻覚が起こると論じる。そこで本稿では,デコーダがどの程度視覚情報を使用するかを定量化するトークンレベルの信頼性尺度を提案する。提案手法は,映像のマスキング時の内部変化を計測する特徴に基づく感度と,クリーンな映像入力と修正された映像入力の確率差を計測する対物信号とを組み合わせる。これらの信号は文レベルの信頼性スコアに集約され、視覚的グラウンドのコンパクトで解釈可能な尺度を提供する。我々は,2つのSLTベンチマーク(PHOENIX-2014TとCSL-Daily)において,グロスベースモデルとグロスフリーモデルの両方を用いて評価を行った。以上の結果から、信頼性は幻覚率を予測し、データセットやアーキテクチャをまたいで一般化し、視覚的劣化の下で低下することが示された。これらの量的傾向の他に、信頼度は推測されたトークンと区別し、参照のないリスク推定を可能にし、テキストベースの信号(自信、複雑度、エントロピー)と組み合わせることで、幻覚リスク推定をさらに改善する。質的な分析は、光沢のないモデルが幻覚の影響を受けやすい理由を浮き彫りにする。本研究は,SLTにおける幻覚診断のための実用的,再利用可能なツールとして信頼性を確立し,マルチモーダル世代におけるより堅牢な幻覚検出の基礎を築いた。

論文の概要: Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

関連論文リスト