MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching
- URL: http://arxiv.org/abs/2501.11299v1
- Date: Mon, 20 Jan 2025 06:56:30 GMT
- Title: MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching
- Authors: Yepeng Liu, Zhichao Sun, Baosheng Yu, Yitian Zhao, Bo Du, Yongchao Xu, Jun Cheng,
- Abstract summary: Keypoint detection and description methods often struggle with multimodal data.
We propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching.
- Score: 54.740256498985026
- License:
- Abstract: Many keypoint detection and description methods have been proposed for image matching or registration. While these methods demonstrate promising performance for single-modality image matching, they often struggle with multimodal data because the descriptors trained on single-modality data tend to lack robustness against the non-linear variations present in multimodal data. Extending such methods to multimodal image matching often requires well-aligned multimodal data to learn modality-invariant descriptors. However, acquiring such data is often costly and impractical in many real-world scenarios. To address this challenge, we propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching using only single-modality training data. Specifically, we propose a novel latent feature aggregation module and a cumulative hybrid aggregation module to enhance the base keypoint descriptors trained on single-modality data by leveraging pre-trained features from Stable Diffusion models. We validate our method with recent keypoint detection and description methods in three multimodal retinal image datasets (CF-FA, CF-OCT, EMA-OCTA) and two remote sensing datasets (Optical-SAR and Optical-NIR). Extensive experiments demonstrate that the proposed MIFNet is able to learn modality-invariant feature for multimodal image matching without accessing the targeted modality and has good zero-shot generalization ability. The source code will be made publicly available.
Related papers
- SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.
It is designed to accurately detect horizontal or oriented objects from any sensor modality.
This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - MINIMA: Modality Invariant Image Matching [52.505282811925454]
We present MINIMA, a unified image matching framework for multiple cross-modal cases.
We scale up the modalities from cheap but rich RGB-only matching data, by means of generative models.
With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability.
arXiv Detail & Related papers (2024-12-27T02:39:50Z) - Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection [10.909746391230206]
Multimodal learning seeks to combine data from multiple input sources to enhance the performance of downstream tasks.
Existing methods that can handle missing modalities involve custom training or adaptation steps for each input modality combination.
We propose Masked Modality Projection (MMP), a method designed to train a single model that is robust to any missing modality scenario.
arXiv Detail & Related papers (2024-10-03T21:41:12Z) - Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance [15.435695491233982]
We propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the Segment Anything Model (SAM) for multi-modal salient object detection (SOD)
We develop underlineSAM with seunderlinemantic funderlineeature fuunderlinesion guidancunderlinee (Sammese)
In the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. Specifically, in the mask decoder, a semantic-geometric
arXiv Detail & Related papers (2024-08-27T13:47:31Z) - CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal
Models with Multiple Image Inputs [48.269363759989915]
The research focuses on two aspects: first, image-to-image matching, and second, multi-image-to-text matching.
We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL.
arXiv Detail & Related papers (2024-01-05T00:26:07Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing [88.6654909354382]
We present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT) for face anti-spoofing.
FM-ViT can flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data.
Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-05-05T04:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.