C-SAW: Self-Supervised Prompt Learning for Image Generalization in
Remote Sensing
- URL: http://arxiv.org/abs/2311.15812v1
- Date: Mon, 27 Nov 2023 13:35:20 GMT
- Title: C-SAW: Self-Supervised Prompt Learning for Image Generalization in
Remote Sensing
- Authors: Avigyan Bhattacharya, Mainak Singha, Ankit Jha, Biplab Banerjee
- Abstract summary: We focus on domain and class generalization problems in analyzing optical remote sensing images, using the large-scale pre-trained vision-language model (VLM), CLIP.
Existing prompt learning techniques overlook the importance of incorporating domain and content information into the prompts.
We propose a solution that ensures domain-invariant prompt learning while enhancing the expressiveness of visual features.
- Score: 12.930814370829893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We focus on domain and class generalization problems in analyzing optical
remote sensing images, using the large-scale pre-trained vision-language model
(VLM), CLIP. While contrastively trained VLMs show impressive zero-shot
generalization performance, their effectiveness is limited when dealing with
diverse domains during training and testing. Existing prompt learning
techniques overlook the importance of incorporating domain and content
information into the prompts, which results in a drop in performance while
dealing with such multi-domain data. To address these challenges, we propose a
solution that ensures domain-invariant prompt learning while enhancing the
expressiveness of visual features. We observe that CLIP's vision encoder
struggles to identify contextual image information, particularly when image
patches are jumbled up. This issue is especially severe in optical remote
sensing images, where land-cover classes exhibit well-defined contextual
appearances. To this end, we introduce C-SAW, a method that complements CLIP
with a self-supervised loss in the visual space and a novel prompt learning
technique that emphasizes both visual domain and content-specific features. We
keep the CLIP backbone frozen and introduce a small set of projectors for both
the CLIP encoders to train C-SAW contrastively. Experimental results
demonstrate the superiority of C-SAW across multiple remote sensing benchmarks
and different generalization tasks.
Related papers
- MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization [25.53345417279545]
Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, leading to rapid advancements in multimodal studies.
CLIP relies on a single contrastive supervision for each image-text pair during representation learning.
We propose Multi-Perspective Language-Image Pretraining (MLIP) to address these issues.
arXiv Detail & Related papers (2024-06-03T15:49:11Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - Remote Sensing Vision-Language Foundation Models without Annotations via
Ground Remote Alignment [61.769441954135246]
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations.
Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language.
arXiv Detail & Related papers (2023-12-12T03:39:07Z) - Domain-Controlled Prompt Learning [49.45309818782329]
Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms.
We propose a textbfDomain-Controlled Prompt Learning for the specific domains.
Our method achieves state-of-the-art performance in specific domain image recognition datasets.
arXiv Detail & Related papers (2023-09-30T02:59:49Z) - GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised
Learning [14.532939492926406]
We propose a prompt learning-based model called GOPro to overcome challenges of CLIP's contrastive loss and SSL's loss.
GOro is trained end-to-end on all three loss objectives, combining the strengths of CLIP and SSL in a principled manner.
arXiv Detail & Related papers (2023-08-22T17:53:26Z) - APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot
Remote Sensing Image Generalization using CLIP [12.73827827842155]
We propose a novel image-conditioned prompt learning strategy called the Visual Attention conditioned Prompts Learning Network (APPLeNet)
APPLeNet emphasizes the importance of multi-scale feature learning in RS scene classification and disentangles visual style and content primitives for domain generalization tasks.
Our results consistently outperform the relevant literature and code is available at https://github.com/mainaksingha01/APPLeNet.
arXiv Detail & Related papers (2023-04-12T17:20:37Z) - StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based
Domain Generalization [26.08922351077744]
StyLIP is a novel approach for Domain Generalization that enhances CLIP's classification performance across domains.
Our method focuses on a domain-agnostic prompt learning strategy, aiming to disentangle the visual style and content information embedded in CLIP's pre-trained vision encoder.
arXiv Detail & Related papers (2023-02-18T07:36:16Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.