FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding
- URL: http://arxiv.org/abs/2511.14901v1
- Date: Tue, 18 Nov 2025 20:39:15 GMT
- Title: FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding
- Authors: Zhenshi Li, Weikang Yu, Dilxat Muhtar, Xueliang Zhang, Pengfeng Xiao, Pedram Ghamisi, Xiao Xiang Zhu,
- Abstract summary: Current RS image-text datasets generate global captions from object-level labels.<n>Their direct application to RS data often leads to performance degradation.<n>We propose FarSLIP, a Fine-grained RS LanguageImage Pretraining framework.
- Score: 27.61517650555534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As CLIP's global alignment limits its ability to capture fine-grained details, recent efforts have focused on enhancing its region-text alignment. However, current remote sensing (RS)-specific CLIP variants still inherit this limited spatial awareness. We identify two key limitations behind this: (1) current RS image-text datasets generate global captions from object-level labels, leaving the original object-level supervision underutilized; (2) despite the success of region-text alignment methods in general domain, their direct application to RS data often leads to performance degradation. To address these, we construct the first multi-granularity RS image-text dataset, MGRS-200k, featuring rich object-level textual supervision for RS region-category alignment. We further investigate existing fine-grained CLIP tuning strategies and find that current explicit region-text alignment methods, whether in a direct or indirect way, underperform due to severe degradation of CLIP's semantic coherence. Building on these, we propose FarSLIP, a Fine-grained Aligned RS Language-Image Pretraining framework. Rather than the commonly used patch-to-CLS self-distillation, FarSLIP employs patch-to-patch distillation to align local and global visual cues, which improves feature discriminability while preserving semantic coherence. Additionally, to effectively utilize region-text supervision, it employs simple CLS token-based region-category alignment rather than explicit patch-level alignment, further enhancing spatial awareness. FarSLIP features improved fine-grained vision-language alignment in RS domain and sets a new state of the art not only on RS open-vocabulary semantic segmentation, but also on image-level tasks such as zero-shot classification and image-text retrieval. Our dataset, code, and models are available at https://github.com/NJU-LHRS/FarSLIP.
Related papers
- SAPL: Semantic-Agnostic Prompt Learning in CLIP for Weakly Supervised Image Manipulation Localization [45.19935082419337]
Malicious image manipulation threatens public safety and requires efficient localization methods.<n>Existing weakly supervised methods rely on image-level binary labels and focus on global classification.<n>We propose Semantic-Agnostic Prompt Learning (SAPL) in CLIP, which learns text prompts that intentionally encode non-semantic, boundary-centric cues.
arXiv Detail & Related papers (2026-01-09T07:25:55Z) - CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP [26.827036116024914]
textscCoPatch is a zero-shot RIS framework that enhances spatial representations in both text and image modalities.<n>We show that textscCoPatch significantly improves spatial grounding in zero-shot RIS across RefCOCO, RefCOCO+, RefCOCOg, and PhraseCut (+ 2--7 mIoU) without requiring any additional training.
arXiv Detail & Related papers (2025-09-27T04:12:10Z) - DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation [23.33389872430993]
We propose a training-free framework, DGL-RSIS, that decouples visual and textual inputs.<n>The framework performs visual-language alignment at both the local semantic and global contextual levels.<n>By matching the enhanced text features with mask-guided visual features, we enable the mask classification.
arXiv Detail & Related papers (2025-08-30T19:45:25Z) - Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation [19.26516470653798]
Weakly Supervised Semantic (WSSS) with image-level labels aims to achieve pixel-level predictions using Class Maps (CAMs)<n>Recent methods primarily focus on image-text alignment for CAM generation, while CLIP's potential in patch-text alignment remains unexplored.<n>We propose ExCEL to explore CLIP's dense knowledge via a novel patchtext alignment paradigm for WSSS.
arXiv Detail & Related papers (2025-03-26T02:00:49Z) - Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations.<n>We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules.<n>CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z) - DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition [71.90536979421093]
We propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of Vision-Language Models (VLMs)
We develop an in-context learning approach to associate the inherent knowledge from LLMs.
Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually.
arXiv Detail & Related papers (2024-07-30T15:58:25Z) - Selective Vision-Language Subspace Projection for Few-shot CLIP [55.361337202198925]
We introduce a method called Selective Vision-Language Subspace Projection (SSP)
SSP incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs.
Our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks.
arXiv Detail & Related papers (2024-07-24T03:45:35Z) - Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment [4.682326604942316]
We focus on Contrastive Language-Image Pre-training (CLIP), a Vision-Language foundation model that achieves high accuracy across various image classification tasks.<n>There are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery.<n>We propose a methodology to align distinct RS image modalities with the visual and textual modalities of CLIP.
arXiv Detail & Related papers (2024-02-15T09:31:07Z) - Symmetrical Linguistic Feature Distillation with CLIP for Scene Text
Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR)
By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow.
Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z) - [CLS] Token is All You Need for Zero-Shot Semantic Segmentation [60.06653755695356]
We propose an embarrassingly simple yet highly effective zero-shot semantic segmentation (ZS3) method, based on the pre-trained vision-language model CLIP.
Specifically, we use the [text] token output from the text branch, as an auxiliary semantic prompt, to replace the navigation [text] token in shallow layers of the ViT-based visual encoder.
Our proposed ZS3 method achieves a SOTA performance, and it is even comparable with those few-shot semantic segmentation methods.
arXiv Detail & Related papers (2023-04-13T01:35:07Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.