Fugu-MT 論文翻訳(概要): When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models

論文の概要: When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models

arxiv url: http://arxiv.org/abs/2606.08918v1
Date: Mon, 08 Jun 2026 01:49:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.562589
Title: When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models
Title（参考訳）: 視覚が過ちをきたすとき:位置注意機構と大規模マルチモーダルモデルによる世界規模の画像ローカライズ手法
Authors: Junchao Cui, Wenqi Shi, Xuanzi Ma, Nan Wu, Shaoyong Du, Xiangyang Luo,
Abstract要約: 既存の手法はしばしば、異なる地理的領域の視覚的に類似したシーンとマッチングすることで、画像の非局所化を行う。位置注意機構と大規模マルチモーダルモデルを統合した新しい検索ベースフレームワークであるTransGeoCLIPを提案する。本研究では,TransGeoCLIPが視覚的に類似した画像のローカライゼーション性能を大幅に向上させることを示す。
参考スコア（独自算出の注目度）: 23.448145400461513
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Worldwide image geo-localization aims to determine the capture location of an image on a global scale. Existing methods often mislocalize images by matching them to visually similar scenes from different geographic regions, which limits reliability in practical applications. To address this issue, we propose TransGeoCLIP, a novel retrieval-based framework that integrates a location attention mechanism and large multimodal models (LMMs). Using the Transformer encoder with location attention to encode GPS coordinates, TransGeoCLIP can effectively distinguish geographic features among visually similar images. The framework consists of two stages: 1) Retrieval database construction, which employs Transformers equipped with location attention mechanisms to encode labeled GPS coordinates and enhance location semantics, subsequently enables joint image-text-GPS embedding through CLIP; 2) Retrieval-augmented inference, which leverages LMMs to infer the final image location prediction from retrieved database results. Extensive experimental results on diverse datasets, including IM2GPS, IM2GPS3k, YFCC4k, and YFCC26k, demonstrate that TransGeoCLIP significantly enhances localization performance for visually similar images. Particularly, street-level localization accuracy (within 1 km error) is substantially improved, surpassing state-of-the-art methods by 1.5%, 1.07%, 7.18%, and 9.75% on these benchmarks, respectively.
Abstract（参考訳）: 世界規模の画像ジオローカライゼーションは、世界規模で画像のキャプチャー位置を決定することを目的としている。既存の方法では、異なる地理的領域の視覚的に類似したシーンとマッチングすることで、画像の非局所化がしばしば行われており、実用的なアプリケーションでは信頼性が制限される。この問題に対処するために,位置アテンション機構と大規模マルチモーダルモデル(LMM)を統合した新しい検索ベースフレームワークであるTransGeoCLIPを提案する。 Transformerエンコーダを用いてGPS座標を符号化することで、TransGeoCLIPは視覚的に類似した画像の地理的特徴を効果的に識別することができる。フレームワークは2つのステージから構成される。 1) ラベル付きGPS座標を符号化し、位置セマンティクスを強化するため、位置注意機構を備えたトランスフォーマーを備えた検索データベースの構築。 2)LMMを利用して検索したデータベース結果から最終画像の位置予測を推測する検索拡張推論。 IM2GPS,IM2GPS3k,YFCC4k,YFCC26kなどの多様なデータセットに対する大規模な実験結果から,TransGeoCLIPが視覚的に類似した画像のローカライゼーション性能を大幅に向上させることが示された。特に、ストリートレベルのローカライゼーション精度(誤差1km)は大幅に改善され、それぞれ1.5%、1.07%、7.18%、9.75%を超える。

論文の概要: When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models

関連論文リスト