Fugu-MT 論文翻訳(概要): TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

論文の概要: TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

arxiv url: http://arxiv.org/abs/2603.06687v1
Date: Wed, 04 Mar 2026 07:27:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:12.797688
Title: TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings
Title（参考訳）: TimeSpot: リアルタイム設定における視覚言語モデルのジオテンポラル理解のベンチマーク
Authors: Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez,
Abstract要約: 視覚言語モデルにおける実世界の時空間推論を評価するためのベンチマークであるTimeSpotを紹介する。 TimeSpotは80か国から1,455の地上レベルの画像で構成されている。視覚的証拠から直接、時間的属性と地理的属性の構造化予測が必要である。また、現実世界の不確実性の下で物理的妥当性をテストする空間的時間的推論タスクも含んでいる。
参考スコア（独自算出の注目度）: 10.091610297997613
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding. TimeSpot is available at: https://TimeSpot-GT.github.io.
Abstract（参考訳）: 地理的時間的理解、視覚的な入力だけで位置、時間、文脈的特性を推測する能力は、災害管理、交通計画、具体的ナビゲーション、世界モデリング、地理教育などの応用を支えている。近年の視覚言語モデル (VLM) ではランドマークや道路標識などを用いた画像位置定位が進んでいるが、時間的信号や物理的に接地された空間的手がかりを推論する能力は限られている。このギャップに対処するために、VLMにおける実世界の時空間推論を評価するベンチマークであるTimeSpotを紹介する。 TimeSpotは80か国1,455の地上レベルの画像で構成されており、視覚的証拠から直接、時間的属性(季節、月、日時、日中)と地理的属性(大陸、国、気候帯、環境タイプ、緯度)を構造化する必要がある。また、現実世界の不確実性の下で物理的妥当性をテストする空間的時間的推論タスクも含んでいる。最先端のオープンソースVLMおよびクローズドソースVLMの評価は、特に時間的推測において低い性能を示す。教師付き微調整は改善するが、その結果は依然として不十分であり、堅牢で物理的に根ざした時間的理解を達成するための新しい方法の必要性を強調している。 TimeSpot は https://TimeSpot-GT.github.io で利用可能である。

論文の概要: TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

関連論文リスト