Related papers: Multi-Perspective Subimage CLIP with Keyword Guidance for Remote Sensing Image-Text Retrieval

Multi-Perspective Subimage CLIP with Keyword Guidance for Remote Sensing Image-Text Retrieval

URL: http://arxiv.org/abs/2601.18190v1
Date: Mon, 26 Jan 2026 06:16:53 GMT
Title: Multi-Perspective Subimage CLIP with Keyword Guidance for Remote Sensing Image-Text Retrieval
Authors: Yifan Li, Shiying Wang, Jianqiang Huang,
Abstract summary: MPS-CLIP is a parameter-efficient framework designed to shift the retrieval paradigm from global matching to keyword-guided fine-grained alignment.<n>Experiments on the RSICD and RSITMD benchmarks demonstrate that MPS-CLIP achieves state-of-the-art performance with 35.18% and 48.40% mean Recall.
Score: 18.55080473948215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Pre-training (VLP) models like CLIP have significantly advanced Remote Sensing Image-Text Retrieval (RSITR). However, existing methods predominantly rely on coarse-grained global alignment, which often overlooks the dense, multi-scale semantics inherent in overhead imagery. Moreover, adapting these heavy models via full fine-tuning incurs prohibitive computational costs and risks catastrophic forgetting. To address these challenges, we propose MPS-CLIP, a parameter-efficient framework designed to shift the retrieval paradigm from global matching to keyword-guided fine-grained alignment. Specifically, we leverage a Large Language Model (LLM) to extract core semantic keywords, guiding the Segment Anything Model (SamGeo) to generate semantically relevant sub-perspectives. To efficiently adapt the frozen backbone, we introduce a Gated Global Attention (G^2A) adapter, which captures global context and long-range dependencies with minimal overhead. Furthermore, a Multi-Perspective Representation (MPR) module aggregates these local cues into robust multi-perspective embeddings. The framework is optimized via a hybrid objective combining multi-perspective contrastive and weighted triplet losses, which dynamically selects maximum-response perspectives to suppress noise and enforce precise semantic matching. Extensive experiments on the RSICD and RSITMD benchmarks demonstrate that MPS-CLIP achieves state-of-the-art performance with 35.18% and 48.40% mean Recall (mR), respectively, significantly outperforming full fine-tuning baselines and recent competitive methods. Code is available at https://github.com/Lcrucial1f/MPS-CLIP.

Related papers

Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z)
LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation [12.192429756057132]
Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories.<n>LoGoSeg integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context.
arXiv Detail & Related papers (2026-02-05T12:03:11Z)
SupScene: Learning Overlap-Aware Global Descriptor for Unconstrained SfM [10.006619357851843]
SupScene is a novel solution that learns global descriptors tailored for finding overlapping image pairs of similar geometric nature for Structure-from-Motion (SfM)<n>Our method achieves state-of-the-art performance, significantly outperforming NetVLAD while introducing a negligible number of additional trainable parameters.
arXiv Detail & Related papers (2026-01-17T06:28:47Z)
DynaPURLS: Dynamic Refinement of Part-aware Representations for Skeleton-based Zero-Shot Action Recognition [51.80782323686666]
We introduce textbfDynaPURLS, a unified framework that establishes robust, multi-scale visual-semantic correspondences.<n>Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics.<n>Experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art.
arXiv Detail & Related papers (2025-12-12T10:39:10Z)
Generalized Contrastive Learning for Universal Multimodal Retrieval [53.70202081784898]
Cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality.<n>This paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the need for new dataset curation.
arXiv Detail & Related papers (2025-09-30T01:25:04Z)
RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization [50.75654397516163]
We propose RelayFormer, a unified framework that adapts to varying resolutions and modalities.<n> RelayFormer partitions inputs into fixed-size sub-images and introduces Global-Local Relay (GLR) tokens.<n>This enables efficient exchange of global cues, such as semantic or temporal consistency, while preserving fine-grained manipulation artifacts.
arXiv Detail & Related papers (2025-08-13T03:35:28Z)
Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs [74.74767980885758]
We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework.<n>CcDPO enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details.<n> Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains.
arXiv Detail & Related papers (2025-05-28T14:24:02Z)
Manifold-aware Representation Learning for Degradation-agnostic Image Restoration [135.90908995927194]
Image Restoration (IR) aims to recover high quality images from degraded inputs affected by various corruptions such as noise, blur, haze, rain, and low light conditions.<n>We present MIRAGE, a unified framework for all in one IR that explicitly decomposes the input feature space into three semantically aligned parallel branches.<n>This modular decomposition significantly improves generalization and efficiency across diverse degradations.
arXiv Detail & Related papers (2025-05-24T12:52:10Z)
FG-CLIP: Fine-Grained Visual and Textual Alignment [3.830067625507938]
We propose Fine-Grained CLIP, which enhances fine-grained understanding through three key innovations.<n>We leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details.<n>We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with hard fine-grained negative samples.
arXiv Detail & Related papers (2025-05-08T09:06:53Z)
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization. Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z)
MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation [26.667974865352708]
MROVSeg is a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone.<n>It uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder.
arXiv Detail & Related papers (2024-08-27T04:45:53Z)
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models. An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.