Fugu-MT 論文翻訳(概要): ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

論文の概要: ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

arxiv url: http://arxiv.org/abs/2508.21496v2
Date: Tue, 02 Sep 2025 17:14:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-03 14:24:52.716985
Title: ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
Title（参考訳）: ELV-Halluc:ロングビデオ理解におけるセマンティックアグリゲーション幻覚のベンチマーク
Authors: Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu,
Abstract要約: ELV-Hallucは、ビデオの幻覚に関する最初のベンチマークである。モデルは、急速に変化するセマンティクスにおいてSAHの傾向が強くなる。また,ELV-Halluc と Video-MME の改善も達成した。
参考スコア（独自算出の注目度）: 61.526407756322264
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.
Abstract（参考訳）: ビデオマルチモーダル大言語モデル(ビデオMLLM)は,ビデオ理解において顕著な進歩を遂げている。しかし、ビデオ入力と相容れない、あるいは無関係な幻覚生成コンテンツに対して脆弱なままである。これまでのビデオ幻覚ベンチマークは主にショートビデオに焦点を当てていた。彼らは幻覚を、強い言語先行、欠落したフレーム、視覚エンコーダによって導入された視覚言語バイアスなどの要因に起因している。これらの原因は、短いビデオでほとんどの幻覚の原因となっているが、それでも幻覚の原因を単純化している。しばしば、モデルは間違った出力を生成するが、正しいフレームレベルのセマンティクスを持つ。このタイプの幻覚を意味的集合幻覚(Semantic Aggregation Hallucination, SAH)と呼び、フレームレベルの意味論を事象レベルの意味群に集約する過程で生じる。複数の事象にまたがる意味的複雑さの増加により、SAHは長いビデオにおいて特に重要になるので、この種の幻覚の原因を分離し、徹底的に調査することが不可欠である。上記の問題に対処するため, ELV-Hallucは, 長期ビデオ幻覚を主軸とした最初のベンチマークであり, SAHの体系的調査を可能にする。実験により,SAHの存在が確認され,意味複雑性によって増加することが示された。さらに、モデルが急速に変化するセマンティクスにおいてSAHの傾向が強くなることもわかりました。さらに、SAHを緩和するための潜在的アプローチについても論じる。位置符号化戦略がSAHの緩和に寄与することを示し、さらにDPO戦略を採用し、イベント内およびイベント間のセマンティクスを識別する能力を高める。これをサポートするため、8K対のデータセットをキュレートし、ELV-HallucとVideo-MMEの両方の改善を実現し、SAH比を27.7%削減した。

論文の概要: ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

関連論文リスト