Fugu-MT 論文翻訳(概要): BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization

論文の概要: BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization

arxiv url: http://arxiv.org/abs/2605.10345v1
Date: Mon, 11 May 2026 10:46:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.743085
Title: BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization
Title（参考訳）: BGG:ジオローカライゼーションのためのビジョンファウンデーションモデル適応によるクロスビュー画像間の幾何学的ギャップのブリッジ
Authors: Wei Wang, Dou Quan, Ning Huyan, Shuang Wang, Yi Li, Pei He, Licheng Jiao,
Abstract要約: Cross-View Geo-Localization (CVGL) は画像検索により画像の位置を求める。本稿では,視覚基礎モデル(VFM)に基づく画像間の幾何学的ギャップをブリッジするパラメータ効率の枠組みを提案する。主にMFEA(Multi-granularity Feature Enhancement Adapter)と周波数対応構造アグリゲーション(FASA)モジュールを含んでいる。
参考スコア（独自算出の注目度）: 50.74663742490919
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Geometric differences between cross-view images, such as drone and satellite views, significantly increase the challenge of Cross-View Geo-Localization (CVGL), which aims to acquire the geolocation of images by image retrieval. To further enhance the CVGL performance, this paper proposes a parameter-efficient adaptation framework for bridging the geometric gap across images based on the vision foundation model (VFM) (e.g., DINOv3), termed BGG. BGG not only effectively leverages the general visual representations of VFM and captures the robust and consistent features from cross-view images, but also utilizes the generalization capabilities of the VFM, significantly improving the CVGL performance. It mainly contains a Multi-granularity Feature Enhancement Adapter (MFEA) and a Frequency-Aware Structural Aggregation (FASA) module. Specifically, MFEA enhances the scale adaptability and viewpoint robustness of features by multi-level dilated convolutions, effectively bridging the cross-view geometric gap with small training costs. Additionally, considering the [CLS] token lacks spatial details for precise image retrieval and localization, the FASA module modulates patch tokens in the frequency domain and performs adaptive aggregation for local structural feature enhancement. Finally, BGG fuses the enhanced local features with the [CLS] token for more accurate CVGL. Extensive experiments on University-1652 and SUES-200 datasets demonstrate that BGG has significant advantages over other methods and achieves state-of-the-art localization performance with low training costs.
Abstract（参考訳）: 画像検索による画像の位置取得を目的としたCVGL(Cross-View Geo-Localization)の課題は,ドローンと衛星ビューなどの画像間の幾何学的差異が著しく増加した。 CVGLの性能をさらに向上するために,BGGと呼ばれるビジョン基礎モデル(例えばDINOv3)に基づいて画像間の幾何学的ギャップをブリッジするパラメータ効率適応フレームワークを提案する。 BGGは、VFMの一般的な視覚表現を効果的に活用し、クロスビュー画像から頑健で一貫した特徴をキャプチャするだけでなく、VFMの一般化機能も活用し、CVGLの性能を大幅に向上させる。主にMFEA(Multi-granularity Feature Enhancement Adapter)と周波数対応構造アグリゲーション(FASA)モジュールを含んでいる。特に、MFEAはマルチレベル拡張畳み込みによる特徴のスケール適応性と視点ロバスト性を高め、小さなトレーニングコストでクロスビュー幾何学的ギャップを効果的に埋める。さらに、[CLS]トークンは正確な画像検索とローカライゼーションのための空間的詳細を欠いているため、FASAモジュールは周波数領域のパッチトークンを変調し、局所構造的特徴強化のための適応アグリゲーションを実行する。最後に、BGGは拡張されたローカル機能と[CLS]トークンを融合して、より正確なCVGLを提供します。 University-1652とSUES-200データセットの大規模な実験では、BGGは他の手法よりも大きな利点があり、訓練コストの低い最先端のローカライゼーション性能を実現することが示されている。

論文の概要: BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization

関連論文リスト