Fugu-MT 論文翻訳(概要): DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching

論文の概要: DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching

arxiv url: http://arxiv.org/abs/2509.16017v1
Date: Fri, 19 Sep 2025 14:26:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-22 18:18:11.199097
Title: DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching
Title（参考訳）: DistillMatch:マルチモーダル画像マッチングのためのビジョンファウンデーションモデルからの知識蒸留の活用
Authors: Meng Yang, Fan Fan, Zizhuo Li, Songchu Deng, Yong Ma, Jiayi Ma,
Abstract要約: マルチモーダル画像マッチングは、異なるモダリティの画像間のピクセルレベルの対応を求める。一致のためのモダリティ-共通特徴を抽出する既存のディープラーニング手法は、様々なシナリオへの適応性に欠ける。本研究では,Vision Foundation Modelの知識蒸留を用いたマルチモーダル画像マッチング手法であるDistillMatchを提案する。
参考スコア（独自算出の注目度）: 43.83196498370696
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal image matching seeks pixel-level correspondences between images of different modalities, crucial for cross-modal perception, fusion and analysis. However, the significant appearance differences between modalities make this task challenging. Due to the scarcity of high-quality annotated datasets, existing deep learning methods that extract modality-common features for matching perform poorly and lack adaptability to diverse scenarios. Vision Foundation Model (VFM), trained on large-scale data, yields generalizable and robust feature representations adapted to data and tasks of various modalities, including multimodal matching. Thus, we propose DistillMatch, a multimodal image matching method using knowledge distillation from VFM. DistillMatch employs knowledge distillation to build a lightweight student model that extracts high-level semantic features from VFM (including DINOv2 and DINOv3) to assist matching across modalities. To retain modality-specific information, it extracts and injects modality category information into the other modality's features, which enhances the model's understanding of cross-modal correlations. Furthermore, we design V2I-GAN to boost the model's generalization by translating visible to pseudo-infrared images for data augmentation. Experiments show that DistillMatch outperforms existing algorithms on public datasets.
Abstract（参考訳）: マルチモーダル画像マッチングは、異なるモダリティのイメージ間のピクセルレベルの対応を求め、クロスモーダル知覚、融合、分析に不可欠である。しかし、モダリティ間の顕著な外観の違いは、この課題を困難にしている。高品質なアノテートデータセットが不足しているため、マッチングのためのモダリティ-共通特徴を抽出する既存のディープラーニング手法は、さまざまなシナリオへの適応性を欠いている。大規模データに基づいて訓練されたビジョンファウンデーションモデル(VFM)は、多モードマッチングを含む様々なモダリティのデータやタスクに適応した一般化可能で堅牢な特徴表現をもたらす。そこで本稿では, VFMからの知識蒸留を用いたマルチモーダル画像マッチング手法であるDistillMatchを提案する。 DistillMatchは知識蒸留を利用して、VFM(DINOv2やDINOv3)から高レベルの意味的特徴を抽出し、モダリティ間のマッチングを支援する軽量の学生モデルを構築している。モダリティ固有の情報を保持するため、他のモダリティの特徴にモダリティカテゴリ情報を抽出し注入し、モダリティ間の相関に関するモデルの理解を深める。さらに、V2I-GANを設計し、データ拡張のための擬似赤外線画像への可視変換によりモデルの一般化を促進する。 DistillMatchは、公開データセット上で既存のアルゴリズムよりも優れています。

論文の概要: DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching

関連論文リスト