Fugu-MT 論文翻訳(概要): SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

論文の概要: SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

arxiv url: http://arxiv.org/abs/2512.04643v1
Date: Thu, 04 Dec 2025 10:17:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-05 21:11:46.110886
Title: SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
Title（参考訳）: SEASON:自己診断的コントラストデコーディングによるビデオ大言語モデルにおける時間的幻覚の緩和
Authors: Chang-Hsun Wu, Kai-Po Chang, Yu-Yang Sheng, Hung-Kai Chung, Kuei-Chun Wang, Yu-Chiang Frank Wang,
Abstract要約: 本稿では,各出力トークンに対する時間的・空間的忠実度を適応的に向上する学習自由度手法を提案する。 SEASONは3つの幻覚検査ベンチマークにおいて、既存のトレーニングなし幻覚緩和アプローチよりも優れています。
参考スコア（独自算出の注目度）: 30.820850789099932
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
Abstract（参考訳）: Video Large Language Models (VideoLLMs) はビデオ理解において顕著な進歩を見せている。しかし、これらのモデルは、ユーザークエリに応答するときにビデオ内の豊富な時間情報を効果的に知覚し、活用することに苦慮している。そのため、時間的矛盾や因果関係の無い出来事の記述をしばしば生成し、幻覚の深刻な問題を引き起こす。多くの先行研究は空間幻覚(例えば物体のミスマッチ)に焦点を当ててきたが、ビデオ理解における時間的推論はいまだにあまり研究されていない。この問題に対処するために,各出力トークンに対する時間的・空間的忠実度を適応的に向上する訓練自由度手法であるSelf-Diagnostic Contrastive Decoding (SEASON)を提案する。それぞれのトークンの幻覚傾向を動的に診断し、対応する時間的および空間的負に対して適応的なコントラスト的復号を適用することでこれを実現できる。大規模な実験により、SEASONは3つの幻覚検査ベンチマークにおいて、既存のトレーニングなし幻覚緩和アプローチよりも優れており、さらに4つの一般的なビデオ理解ベンチマークにおけるビデオLLMを改善している。コードは受理時にリリースされます。

論文の概要: SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

関連論文リスト