Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning
- URL: http://arxiv.org/abs/2510.24321v1
- Date: Tue, 28 Oct 2025 11:39:22 GMT
- Title: Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning
- Authors: Ivica Dimitrovski, Vlatko Spasev, Ivan Kitanovski,
- Abstract summary: We explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification.<n>We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features.<n>Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery.
- Score: 0.9558392439655014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.
Related papers
- AttriPrompt: Dynamic Prompt Composition Learning for CLIP [41.37140060183439]
AttriPrompt is a novel framework that enhances and refines textual semantic representations.<n>We introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features.<n>Experiments demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37% improvement in the base-to-novel setting.
arXiv Detail & Related papers (2025-09-07T07:07:59Z) - UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement [25.139037597606233]
Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain.<n>Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge.<n>We propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimize both textual prompts and visual representations.
arXiv Detail & Related papers (2025-07-01T13:00:41Z) - ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP [12.05848395374439]
Continual learning empowers pre-trained vision-language models to adapt effectively to novel or previously underrepresented data distributions.<n>ChordPrompt introduces cross-modal prompts to leverage interactions between visual and textual information.<n>ChordPrompt outperforms state-of-the-art methods in zero-shot generalization and downstream task performance.
arXiv Detail & Related papers (2025-06-24T13:22:06Z) - An Empirical Study of Federated Prompt Learning for Vision Language Model [89.2963764404892]
This paper systematically investigates the behavioral differences between language prompt learning (VPT) and vision prompt learning (VLM)<n>We evaluate the impact of various FL and prompt configurations, such as client scale, aggregation strategies, and prompt length, to assess the robustness of Federated Prompt Learning (FPL)
arXiv Detail & Related papers (2025-05-29T03:09:15Z) - DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models [5.027492394254859]
DiSa is a Directional Saliency-Aware Prompt Learning framework.<n>It integrates two complementary regularization strategies to enhance generalization.<n>It consistently outperforms state-of-the-art prompt learning methods across various settings.
arXiv Detail & Related papers (2025-05-26T00:14:52Z) - Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z) - GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection [5.530212768657544]
We introduce glocal contrastive learning to improve the complementary learning of global and local prompts.<n>The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets.
arXiv Detail & Related papers (2024-11-09T05:22:13Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - Learning Domain Invariant Prompt for Vision-Language Models [31.581652862478965]
We propose a novel prompt learning paradigm that directly generates emphdomain invariant prompt that can be generalized to unseen domains, called MetaPrompt.
Our method consistently and significantly outperforms existing methods.
arXiv Detail & Related papers (2022-12-08T11:23:24Z) - LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of
Vision & Language Models [67.19124099815645]
We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting.
LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available.
LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
arXiv Detail & Related papers (2022-10-03T17:56:35Z) - Text Recognition in Real Scenarios with a Few Labeled Samples [55.07859517380136]
Scene text recognition (STR) is still a hot research topic in computer vision field.
This paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation.
Our approach can maximize the character-level confusion between the source domain and the target domain.
arXiv Detail & Related papers (2020-06-22T13:03:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.