Fugu-MT 論文翻訳(概要): PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

論文の概要: PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

arxiv url: http://arxiv.org/abs/2603.26653v1
Date: Fri, 27 Mar 2026 17:54:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.628228
Title: PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
Title（参考訳）: PerceptionComp: 複雑な知覚中心推論のためのビデオベンチマーク
Authors: Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian, Zuyan Liu, Yushi Hu, Haoning Wu, Yuhao Dong, Benlin Liu, Ziwei Liu, Ranjay Krishna,
Abstract要約: 本稿では,知覚中心のビデオ推論のベンチマークであるPerceptionCompを紹介する。ベンチマークには、さまざまなドメインの279のビデオに関する1,114の非常に複雑な質問が含まれている。人間の研究によると、PerceptionCompは、かなりのテスト時間思考と繰り返し知覚ステップを必要とする。
参考スコア（独自算出の注目度）: 63.52215283384644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.
Abstract（参考訳）: 複雑な、長い水平、知覚中心のビデオ推論のための手動注釈付きベンチマークであるPerceptionCompを紹介する。パーセプションコンプリートは、各質問に答えるには、複数の時間的に分離された視覚的エビデンスと構成的制約、オブジェクト、属性、関係、場所、行動、イベントなどの知覚的サブタスク、意味認識、視覚的対応、時間的推論、空間的推論といったスキルを必要とする。このベンチマークには、都市ウォークツアー、屋内ヴィラツアー、ビデオゲーム、エクストリームアウトドアスポーツなど、さまざまな領域の279の動画に関する1,114の非常に複雑な質問が含まれている。人間の研究では、PerceptionCompはテストタイムの思考と繰り返しの知覚ステップを必要とすることが示されており、参加者は以前のベンチマークよりもはるかに時間がかかり、再視聴が許可されない場合には精度がほぼ低下する(18.97%)。私たちの評価で最高のモデルであるGemini-3-Flashは5チョイス設定で45.96%の精度にしか達せず、オープンソースモデルは40%以下にとどまっている。これらの結果は、知覚中心の長距離ビデオ推論が依然として大きなボトルネックであり、PerceptionCompが知覚的推論の進歩を促進することを願っていることを示唆している。

論文の概要: PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

関連論文リスト