Fugu-MT 論文翻訳(概要): GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models

論文の概要: GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models

arxiv url: http://arxiv.org/abs/2510.07791v1
Date: Thu, 09 Oct 2025 05:09:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:14.876009
Title: GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
Title（参考訳）: GTR-Bench:視覚言語モデルにおけるジオテンポラル推論の評価
Authors: Qinghongbing Xie, Zhaoyuan Xia, Feng Zhu, Lijun Gong, Ziyue Li, Rui Zhao, Long Zeng,
Abstract要約: GTR-Bench(Geo-Temporal Reasoning benchmark)は、大規模カメラネットワークにおける移動対象の地理的時間的推論のための新しい課題である。 GTR-Bench上で10以上の人気のあるVisual-Language Model (VLM)の評価は、最高のプロプライエタリモデルであるGemini-2.5-Proでさえ、時空間推論において人間のパフォーマンス(78.61%)よりもかなり遅れていることを示している。 GTR-Benchは貴重な洞察を提供し、空間的時間的知性の研究と応用の新たな機会を開く。
参考スコア（独自算出の注目度）: 12.634203010453282
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs' geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs' reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.
Abstract（参考訳）: 近年、視覚言語モデル(VLM)の時空間知能は、自律運転、身体AI、一般人工知能の重要性から注目されている。既存の時空間ベンチマークは主に、画像/ビデオコンテキストによる自我中心の視点推論や、グラフィックコンテキスト(例えば地図)による地理的視点推論に焦点を当てているため、VLMの空間的空間的インテリジェンスを画像/ビデオおよびグラフィックコンテキストの両方で評価することができず、交通管理や緊急応答といった分野において重要である。このギャップに対処するため,大規模カメラネットワークにおける移動対象の地理的時間的推論の新たな課題であるGeo-Temporal Reasoning benchmark (GTR-Bench)を導入する。 GTR-Benchは、マップとビデオ間の複数の視点スイッチ、重複しない視野の複数のビデオ間の共同推論、あらゆるビデオコンテキストで観測されていない空間的時間領域に対する推論を必要とするため、より難しい。 GTR-Bench上で10以上の人気のあるVLMの評価は、最高のプロプライエタリモデルであるGemini-2.5-Pro (34.9%)でさえ、時間的推論において人間のパフォーマンス(78.61%)よりもかなり遅れていることを示している。さらに, GTR-Benchの包括的解析により, 時空間推論における現在のモデルの主な欠陥が3つあることが明らかとなった。 1) VLMの推論は時空間文脈の不均衡利用によって損なわれる。 2) VLMは時間的予測に弱いため,時間的強調タスクよりも時間的強調タスクのパフォーマンスが低下する。 (3) VLMは、地図データを多視点ビデオ入力で理解・調整する能力に欠ける。我々は、GTR-Benchが貴重な洞察を与え、空間的時間的知性の研究と応用の新たな機会を開くと信じている。ベンチマークとコードはhttps://github.com/X-Luffy/GTR-Bench.comでリリースされる。

論文の概要: GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models

関連論文リスト