APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot
Remote Sensing Image Generalization using CLIP
- URL: http://arxiv.org/abs/2304.05995v1
- Date: Wed, 12 Apr 2023 17:20:37 GMT
- Title: APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot
Remote Sensing Image Generalization using CLIP
- Authors: Mainak Singha, Ankit Jha, Bhupendra Solanki, Shirsha Bose and Biplab
Banerjee
- Abstract summary: We propose a novel image-conditioned prompt learning strategy called the Visual Attention conditioned Prompts Learning Network (APPLeNet)
APPLeNet emphasizes the importance of multi-scale feature learning in RS scene classification and disentangles visual style and content primitives for domain generalization tasks.
Our results consistently outperform the relevant literature and code is available at https://github.com/mainaksingha01/APPLeNet.
- Score: 12.73827827842155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the success of large-scale vision-language models (VLMs)
such as CLIP has led to their increased usage in various computer vision tasks.
These models enable zero-shot inference through carefully crafted instructional
text prompts without task-specific supervision. However, the potential of VLMs
for generalization tasks in remote sensing (RS) has not been fully realized. To
address this research gap, we propose a novel image-conditioned prompt learning
strategy called the Visual Attention Parameterized Prompts Learning Network
(APPLeNet). APPLeNet emphasizes the importance of multi-scale feature learning
in RS scene classification and disentangles visual style and content primitives
for domain generalization tasks. To achieve this, APPLeNet combines visual
content features obtained from different layers of the vision encoder and style
properties obtained from feature statistics of domain-specific batches. An
attention-driven injection module is further introduced to generate visual
tokens from this information. We also introduce an anti-correlation regularizer
to ensure discrimination among the token embeddings, as this visual information
is combined with the textual tokens. To validate APPLeNet, we curated four
available RS benchmarks and introduced experimental protocols and datasets for
three domain generalization tasks. Our results consistently outperform the
relevant literature and code is available at
https://github.com/mainaksingha01/APPLeNet
Related papers
- EarthMarker: Visual Prompt Learning for Region-level and Point-level Remote Sensing Imagery Comprehension [12.9701635989222]
The first visual prompting model named EarthMarker is proposed, which excels in image-level, region-level, and point-level RS imagery interpretation.
To endow the EarthMarker with versatile multi-granularity visual perception abilities, the cross-domain phased learning strategy is developed.
To tackle the lack of RS visual prompting data, a dataset named RSVP featuring multi-modal fine-grained visual prompting instruction is constructed.
arXiv Detail & Related papers (2024-07-18T15:35:00Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - C-SAW: Self-Supervised Prompt Learning for Image Generalization in
Remote Sensing [12.930814370829893]
We focus on domain and class generalization problems in analyzing optical remote sensing images, using the large-scale pre-trained vision-language model (VLM), CLIP.
Existing prompt learning techniques overlook the importance of incorporating domain and content information into the prompts.
We propose a solution that ensures domain-invariant prompt learning while enhancing the expressiveness of visual features.
arXiv Detail & Related papers (2023-11-27T13:35:20Z) - StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based
Domain Generalization [26.08922351077744]
StyLIP is a novel approach for Domain Generalization that enhances CLIP's classification performance across domains.
Our method focuses on a domain-agnostic prompt learning strategy, aiming to disentangle the visual style and content information embedded in CLIP's pre-trained vision encoder.
arXiv Detail & Related papers (2023-02-18T07:36:16Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Learning Visual Representations with Caption Annotations [19.24013129952071]
We propose a proxy task to learn visual representations over image-caption pairs.
ICMLM consists in predicting masked words in captions by relying on visual cues.
Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations.
arXiv Detail & Related papers (2020-08-04T08:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.