Fugu-MT 論文翻訳(概要): Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding

論文の概要: Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding

arxiv url: http://arxiv.org/abs/2601.22574v1
Date: Fri, 30 Jan 2026 05:16:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.237815
Title: Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding
Title（参考訳）: 時空間的コントラストデコーディングによるビデオ大言語モデルにおける幻覚の緩和
Authors: Yuansheng Gao, Jinman Zhao, Tong Zhang, Xingguo Xu, Han Bao, Zonghui Wang, Wenzhi Chen,
Abstract要約: 本稿では,時空間・意味的コントラストデコーディングというデコーディング戦略を提案する。この戦略は、ビデオ特徴の新しい一貫性とセマンティックアソシエーションを意図的に破壊することによって、ネガティブな特徴を構築する。本手法は,幻覚の発生を効果的に軽減するだけでなく,一般的な映像理解と推論能力も維持する。
参考スコア（独自算出の注目度）: 23.767895980891264
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although Video Large Language Models perform remarkably well across tasks such as video understanding, question answering, and reasoning, they still suffer from the problem of hallucination, which refers to generating outputs that are inconsistent with explicit video content or factual evidence. However, existing decoding methods for mitigating video hallucinations, while considering the spatiotemporal characteristics of videos, mostly rely on heuristic designs. As a result, they fail to precisely capture the root causes of hallucinations and their fine-grained temporal and semantic correlations, leading to limited robustness and generalization in complex scenarios. To more effectively mitigate video hallucinations, we propose a novel decoding strategy termed Spatiotemporal-Semantic Contrastive Decoding. This strategy constructs negative features by deliberately disrupting the spatiotemporal consistency and semantic associations of video features, and suppresses video hallucinations through contrastive decoding against the original video features during inference. Extensive experiments demonstrate that our method not only effectively mitigates the occurrence of hallucinations, but also preserves the general video understanding and reasoning capabilities of the model.
Abstract（参考訳）: ビデオ大言語モデルは、ビデオ理解、質問応答、推論などのタスクで著しくうまく機能するが、明確なビデオ内容や事実的証拠と矛盾する出力を生成する幻覚の問題に苦しむ。しかし、ビデオ幻覚を緩和する既存の復号法は、ビデオの時空間的特性を考慮しながら、主にヒューリスティックな設計に依存している。その結果、幻覚の根本原因とその微粒な時間的・意味的相関を正確に捉えられず、複雑なシナリオでは頑健さと一般化が制限される。ビデオ幻覚をより効果的に緩和するために,時空間・意味的コントラスト復号法と呼ばれる新しい復号法を提案する。この戦略は、ビデオ特徴の時空間的一貫性と意味的関連を意図的に破壊することにより否定的な特徴を構築し、推論中に元の映像特徴に対して対照的な復号をすることでビデオ幻覚を抑制する。大規模な実験により,本手法は幻覚の発生を効果的に軽減するだけでなく,一般的な映像理解能力や推論能力を保っていることが示された。

論文の概要: Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding

関連論文リスト