Fugu-MT 論文翻訳(概要): GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

論文の概要: GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

arxiv url: http://arxiv.org/abs/2511.15705v1
Date: Wed, 19 Nov 2025 18:59:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-20 15:51:28.953054
Title: GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
Title（参考訳）: GeoVista: ジオローカライゼーションのためのWeb拡張エージェントビジュアル推論
Authors: Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu, Yongming Rao,
Abstract要約: エージェント視覚推論に関する最近の研究は、深いマルチモーダル理解を可能にするが、主に画像操作ツールに焦点を当てている。そこで本研究では,視覚的グラウンディングだけでなく,仮説の検証や修正のためにWeb検索も必要とするジオローカライゼーションタスクを再考する。既存のジオローカライゼーションベンチマークは、高解像度画像の必要性と深部エージェント推論の局所化課題を満たすことができないため、GeoBenchをキュレートする。推論ループ内にツールの実行をシームレスに統合するエージェントモデルであるGeoVistaを提案し,興味のある領域を拡大するイメージズームインツールと関連する領域を検索するWeb検索ツールを提案する。
参考スコア（独自算出の注目度）: 53.080882980294795
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
Abstract（参考訳）: エージェント視覚推論に関する最近の研究は、深いマルチモーダル理解を可能にするが、主に画像操作ツールに焦点を当てており、より汎用的なエージェントモデルへのギャップを残している。そこで本研究では,視覚的グラウンディングだけでなく,推論中の仮説の検証や修正を行うためのWeb検索も必要とするジオローカライズタスクを再検討する。既存のジオローカライゼーションベンチマークは、高解像度画像の必要性と深部エージェント推論のローカライゼーション課題を満たすことができないため、さまざまな都市の衛星画像のサブセットとともに、世界中の写真やパノラマを含むベンチマークであるGeoBenchをキュレートし、エージェントモデルのジオローカライゼーション能力を厳格に評価する。また,関心領域を拡大するイメージズームインツールや関連するWeb情報を検索するWeb検索ツールなど,推論ループ内でツールの実行をシームレスに統合するエージェントモデルであるGeoVistaを提案する。我々は、推論パターンとツール使用前を学習するための冷間開始制御微調整(SFT)ステージと、推論能力を高めるための強化学習(RL)ステージを含む、完全なトレーニングパイプラインを開発する。我々は階層的な報酬を採用し、多段階の地理情報を活用し、総合的な地理的ローカライゼーション性能を向上させる。実験の結果,GeoVistaはジオローカライゼーションタスクにおける他のオープンソースエージェントモデルを大きく上回り,ほとんどのメトリクスにおいてGemini-2.5-flashやGPT-5といったクローズドソースモデルに匹敵する性能を実現していることがわかった。

論文の概要: GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

関連論文リスト