SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality
- URL: http://arxiv.org/abs/2409.08083v1
- Date: Thu, 12 Sep 2024 14:38:21 GMT
- Title: SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality
- Authors: Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang,
- Abstract summary: Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact.
It is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models.
This work presents a simple and effective framework SimMAT to study an open problem: the transferability from vision foundation models trained on natural RGB images to other image modalities of different physical properties.
- Score: 136.82569085134554
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework SimMAT to study an open problem: the transferability from vision foundation models trained on natural RGB images to other image modalities of different physical properties (e.g., polarization). SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained foundation model. We apply SimMAT to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new image modality. Given the absence of relevant benchmarks, we construct a new benchmark to evaluate the transfer learning performance. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. Specifically, SimMAT can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. We hope that SimMAT can raise awareness of cross-modal transfer learning and benefit various fields for better results with vision foundation models.
Related papers
- Synthetic Vision: Training Vision-Language Models to Understand Physics [9.474337395173388]
We propose two methods to enhance Vision-Language Models' physical reasoning capabilities using simulated data.
First, we fine-tune a pre-trained VLM using question-answer pairs generated from simulations relevant to physical reasoning tasks.
Second, we introduce Physics Context Builders (PCBs) to create scene descriptions enriched with physical properties and processes.
arXiv Detail & Related papers (2024-12-11T18:40:16Z) - SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality [116.54152244934775]
Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact.
It is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models.
This work presents a simple and effective framework, SimCMF, to study an important problem: cross-modal fine-tuning from vision foundation models trained on natural RGB images to other imaging modalities of different physical properties.
arXiv Detail & Related papers (2024-11-27T16:35:58Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - Diversity is Definitely Needed: Improving Model-Agnostic Zero-shot
Classification via Stable Diffusion [22.237426507711362]
Model-Agnostic Zero-Shot Classification (MA-ZSC) refers to training non-specific classification architectures to classify real images without using any real images during training.
Recent research has demonstrated that generating synthetic training images using diffusion models provides a potential solution to address MA-ZSC.
We propose modifications to the text-to-image generation process using a pre-trained diffusion model to enhance diversity.
arXiv Detail & Related papers (2023-02-07T07:13:53Z) - sim2real: Cardiac MR Image Simulation-to-Real Translation via
Unsupervised GANs [0.4433315630787158]
We provide image simulation on virtual XCAT subjects with varying anatomies.
We propose sim2real translation network to improve image realism.
Our usability experiments suggest that sim2real data exhibits a good potential to augment training data and boost the performance of a segmentation algorithm.
arXiv Detail & Related papers (2022-08-09T16:06:06Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - Task2Sim : Towards Effective Pre-training and Transfer from Synthetic
Data [74.66568380558172]
We study the transferability of pre-trained models based on synthetic data generated by graphics simulators to downstream tasks.
We introduce Task2Sim, a unified model mapping downstream task representations to optimal simulation parameters.
It learns this mapping by training to find the set of best parameters on a set of "seen" tasks.
Once trained, it can then be used to predict best simulation parameters for novel "unseen" tasks in one shot.
arXiv Detail & Related papers (2021-11-30T19:25:27Z) - SimAug: Learning Robust Representations from Simulation for Trajectory
Prediction [78.91518036949918]
We propose a novel approach to learn robust representation through augmenting the simulation training data.
We show that SimAug achieves promising results on three real-world benchmarks using zero real training data.
arXiv Detail & Related papers (2020-04-04T21:22:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.