Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval
- URL: http://arxiv.org/abs/2505.16756v1
- Date: Thu, 22 May 2025 14:59:30 GMT
- Title: Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval
- Authors: Hailong Ning, Siying Wang, Tao Lei, Xiaopeng Cao, Huanmin Dou, Bin Zhao, Asoke K. Nandi, Petia Radeva,
- Abstract summary: This study proposes a Representation Discrepancy Bridging (RDB) method for the Remote Image-Text Retrieval (RSITR) task.<n>Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics.
- Score: 15.503629941274621
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.
Related papers
- Implicit Neural Representation-Based Continuous Single Image Super Resolution: An Empirical Study [50.15623093332659]
Implicit neural representation (INR) has become the standard approach for arbitrary-scale image super-resolution (ASSR)<n>We compare existing techniques across diverse settings and present aggregated performance results on multiple image quality metrics.<n>We examine a new loss function that penalizes intensity variations while preserving edges, textures, and finer details during training.
arXiv Detail & Related papers (2026-01-25T07:09:20Z) - Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism [10.17375002962432]
This study proposes a Vision-language Model (VLM) framework integrated with two key innovations.<n>The DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content.<n>The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels.
arXiv Detail & Related papers (2025-12-29T06:51:20Z) - SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning [17.104905795008555]
We propose a framework named Selective Cross-modal Prompt Tuning (SCING) that enhances cross-modal alignment and robustness against real-worlds.<n>Our framework eliminates heavy adapters while maintaining efficient inference, achieving an optimal trade-off between performance and computational overhead.
arXiv Detail & Related papers (2025-07-01T07:21:31Z) - DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models [9.109484087832058]
DiffRIS is a novel framework that harnesses the semantic understanding capabilities of pre-trained text-to-image diffusion models for RRSIS tasks.<n>Our framework introduces two key innovations: a context perception adapter (CP-adapter) and a cross-modal reasoning decoder (PCMRD)
arXiv Detail & Related papers (2025-06-23T02:38:56Z) - Quantitative Comparison of Fine-Tuning Techniques for Pretrained Latent Diffusion Models in the Generation of Unseen SAR Images [0.0]
We adapt an open-source text-to-image foundation model to the Synthetic Aperture Radar (SAR) modality.<n>We compare full fine-tuning and parameter-efficient Low-Rank Adaptation (LoRA) across the UNet diffusion backbone, the Variational Autoencoder (VAE) and the text encoders.<n>Our results show that a hybrid strategy-full UNet tuning with LoRA on the text encoders and a learned token embedding-best preserves SAR geometry and texture.
arXiv Detail & Related papers (2025-06-16T09:48:01Z) - Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification [80.83325513157637]
Few-Shot Remote Sensing Scene Classification (FS-RSSC) presents the challenge of classifying remote sensing images with limited labeled samples.<n>We propose a novel Optimal Transport Adapter Tuning (OTAT) framework aimed at constructing an ideal Platonic representational space.
arXiv Detail & Related papers (2025-03-19T07:04:24Z) - Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - OT-Attack: Enhancing Adversarial Transferability of Vision-Language
Models via Optimal Transport Optimization [65.57380193070574]
Vision-language pre-training models are vulnerable to multi-modal adversarial examples.
Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples.
We propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack.
arXiv Detail & Related papers (2023-12-07T16:16:50Z) - SCMM: Calibrating Cross-modal Representations for Text-Based Person Search [43.17325362167387]
Text-Based Person Search (TBPS) is a crucial task in the Internet of Things (IoT) domain.<n>For cross-modal TBPS tasks, it is critical to obtain well-distributed representation in the common space.<n>We present Sew embedding and Masked Modeling (SCMM) that calibrates cross-modal representations by learning compact and well-aligned embeddings.
arXiv Detail & Related papers (2023-04-05T07:50:16Z) - ResiDualGAN: Resize-Residual DualGAN for Cross-Domain Remote Sensing
Images Semantic Segmentation [15.177834801688979]
The performance of a semantic segmentation model for remote sensing (RS) images pretrained on an annotated dataset would greatly decrease when testing on another unannotated dataset because of the domain gap.
Adversarial generative methods, e.g., DualGAN, are utilized for unpaired image-to-image translation to minimize the pixel-level domain gap.
In this paper, ResiDualGAN is proposed for RS images translation, where a resizer module is used for addressing the scale discrepancy of RS datasets.
arXiv Detail & Related papers (2022-01-27T13:56:54Z) - Self-supervised Correlation Mining Network for Person Image Generation [9.505343361614928]
Person image generation aims to perform non-rigid deformation on source images.
We propose a Self-supervised Correlation Mining Network (SCM-Net) to rearrange the source images in the feature space.
For improving the fidelity of cross-scale pose transformation, we propose a graph based Body Structure Retaining Loss.
arXiv Detail & Related papers (2021-11-26T03:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.