Fugu-MT 論文翻訳(概要): LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

論文の概要: LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

arxiv url: http://arxiv.org/abs/2510.17305v2
Date: Tue, 21 Oct 2025 10:16:53 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:12.030477
Title: LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
Title（参考訳）: LongInsightBench:人間中心の長時間映像理解におけるOmni-Modal Modelの評価のための総合ベンチマーク
Authors: ZhaoYang Han, Qihan Lin, Hao Liang, Bowen Chen, Zhou Liu, Wentao Zhang,
Abstract要約: textbfLongInsightBenchは、長いビデオを理解するモデルの能力を評価するために設計された最初のベンチマークである。ベンチマークでは,textbfa, textbfb, textbfcの3つの重要な領域を抽出した。
参考スコア（独自算出の注目度）: 19.03169157546538
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.
Abstract（参考訳）: これは、人間の言語、視点、行動、その他の文脈要素に焦点を当て、長いビデオを理解するためのモデルの能力を評価するために設計された最初のベンチマークである。ベンチマークは以下の3つの重要な領域を抜粋する: \textbf{a) Long-Duration, Information-Dense Videos:} オープンソースのデータセットから約1,000の動画を慎重に選択する。図2. \textbf{b) Diverse and Challenging Task Scenarios:} イベント内タスクとイベント間タスクの両方を含む6つの困難なタスクシナリオを設計しました。厳密で包括的な品質保証パイプライン:} 合成された質問と回答の難易度と妥当性を確保するために、3段階の半自動データ品質保証パイプラインを開発した。 LongInsightBenchに基づいて、一連の実験を設計しました。実験の結果,Omni-modal model (OLMs) は正確な時間的局所化 (T-Loc) と長距離因果推論 (CE-Caus) を必要とするタスクにおいて依然として課題に直面していることがわかった。拡張実験により,OLMの多モード融合における情報損失と処理バイアスが明らかになった。私たちのデータセットとコードはhttps://anonymous.4open.science/r/LongInsightBench-910F/で公開されています。

論文の概要: LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

関連論文リスト