Fugu-MT 論文翻訳(概要): TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

論文の概要: TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

arxiv url: http://arxiv.org/abs/2605.09904v2
Date: Tue, 12 May 2026 03:09:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 18:21:07.042686
Title: TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
Title（参考訳）: TOC-Bench:ビデオ大言語モデルのための時間オブジェクト一貫性ベンチマーク
Authors: Junzhe Chen, Siyuan Meng, Yuxi Chen, Man Zhao, Wenyao Gui, Xiaojie Guo,
Abstract要約: ビデオ大言語モデル(ビデオ-LLM)は、一般的なビデオ理解において大きな進歩を遂げているが、時間的オブジェクトの一貫性を維持する能力はいまだ探索されていない。ビデオLLMにおける時間的オブジェクトの一貫性を評価するための診断ベンチマークであるTOC-Benchを紹介する。
参考スコア（独自算出の注目度）: 9.648992690108086
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video large language models (Video-LLMs) have made strong progress in general video understanding, but their ability to maintain temporal object consistency remains underexplored. Existing benchmarks often emphasize event recognition, action understanding, or coarse temporal reasoning, while rarely testing whether models can preserve the identity, state, and continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. We introduce TOC-Bench, a diagnostic benchmark for evaluating temporal object consistency in Video-LLMs. TOC-Bench is object-track grounded: each queried subject is linked to a per-frame trajectory and a structured temporal event timeline. To ensure that questions require temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we design a three-layer temporal-necessity filtering protocol, which removes 60.7% of candidate QA pairs and retains 17,900 temporally dependent items across 10 diagnostic dimensions. From this pool, we construct a human-verified benchmark with 2,323 high-quality QA pairs over 1,951 videos. Experiments on representative Video-LLMs show that temporal object consistency remains a major unsolved challenge, with notable weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, even when models perform well on general video understanding benchmarks. These results suggest that object-centric temporal coherence is a key bottleneck for current Video-LLMs, and that TOC-Bench provides a focused platform for diagnosing and improving object-aware temporal reasoning. The resource is available at https://github.com/cjzcjz666/toc_bench.git.
Abstract（参考訳）: ビデオ大言語モデル(ビデオ-LLM)は、一般的なビデオ理解において大きな進歩を遂げているが、時間的オブジェクトの一貫性を維持する能力はいまだ探索されていない。既存のベンチマークでは、イベント認識、アクション理解、あるいは粗い時間的推論が強調されることが多いが、モデルが同一オブジェクトの同一性、状態、連続性を、隠蔽、消失、再出現、状態遷移、オブジェクト間の相互作用で保持できるかどうかを検査することは稀である。ビデオLLMにおける時間的オブジェクトの一貫性を評価するための診断ベンチマークであるTOC-Benchを紹介する。 TOC-Benchはオブジェクトトラックで、各クエリ対象はフレーム単位の軌跡と時間的イベントタイムラインにリンクされる。言語先行や単一フレームのショートカット,あるいは非順序のフレームキューよりも,時間的に順序付けられた視覚的エビデンスを必要とすることを保証するため,10次元にわたる時間的依存項目の60.7%を除去し,時間的依存項目を17,900個保持する3層時間的必要フィルタリングプロトコルを設計する。このプールから,1,951本以上の高品質QAペアが2,323本ある人間検証ベンチマークを構築した。代表的なビデオ-LLMの実験では、時間的オブジェクトの一貫性は依然として未解決の課題であり、一般的なビデオ理解ベンチマークでモデルがうまく機能している場合でも、イベントカウント、イベントオーダリング、アイデンティティに敏感な推論、幻覚認識の検証において顕著な弱点がある。これらの結果は,現在のビデオLLMにおいて,オブジェクト中心の時間的コヒーレンスが重要なボトルネックであり,TOC-Benchがオブジェクト認識の時間的推論の診断と改善に焦点を絞ったプラットフォームを提供することを示唆している。リソースはhttps://github.com/cjzcjz666/toc_bench.gitで入手できる。

論文の概要: TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

関連論文リスト