Related papers: PMPGuard: Catching Pseudo-Matched Pairs in Remote Sensing Image-Text Retrieval

PMPGuard: Catching Pseudo-Matched Pairs in Remote Sensing Image-Text Retrieval

URL: http://arxiv.org/abs/2512.18660v1
Date: Sun, 21 Dec 2025 09:16:11 GMT
Title: PMPGuard: Catching Pseudo-Matched Pairs in Remote Sensing Image-Text Retrieval
Authors: Pengxiang Ouyang, Qing Ma, Zheng Wang, Cong Bai,
Abstract summary: Remote sensing (RS) image-text retrieval faces challenges due to the presence of Pseudo-Matched Pairs (PMPs)<n>We propose a novel retrieval framework that leverages Cross-Modal Gated Attention and a Positive-Negative Awareness Attention mechanism.<n>Our method consistently achieves state-of-the-art performance, highlighting its robustness and effectiveness in handling real-world mismatches and PMPs in RS image-text retrieval tasks.
Score: 17.251288844354914
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Remote sensing (RS) image-text retrieval faces significant challenges in real-world datasets due to the presence of Pseudo-Matched Pairs (PMPs), semantically mismatched or weakly aligned image-text pairs, which hinder the learning of reliable cross-modal alignments. To address this issue, we propose a novel retrieval framework that leverages Cross-Modal Gated Attention and a Positive-Negative Awareness Attention mechanism to mitigate the impact of such noisy associations. The gated module dynamically regulates cross-modal information flow, while the awareness mechanism explicitly distinguishes informative (positive) cues from misleading (negative) ones during alignment learning. Extensive experiments on three benchmark RS datasets, i.e., RSICD, RSITMD, and RS5M, demonstrate that our method consistently achieves state-of-the-art performance, highlighting its robustness and effectiveness in handling real-world mismatches and PMPs in RS image-text retrieval tasks.

Related papers

Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations [56.816929931908824]
We pioneer the detection of semantically-coordinated manipulations in multimodal data.<n>We propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework.<n>Our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-09-16T04:18:48Z)
A Cross-Modal Rumor Detection Scheme via Contrastive Learning by Exploring Text and Image internal Correlations [15.703292627605304]
This paper presents a novel cross-modal rumor detection scheme based on contrastive learning.<n>A scale-aware fusion network is designed to integrate the highly pertinent multi-scale image features with global text features.<n>The experimental results demonstrate that it achieves a substantial performance improvement over existing state-of-the-art approaches in rumor detection.
arXiv Detail & Related papers (2025-08-15T01:13:50Z)
Data-Efficient Generalization for Zero-shot Composed Image Retrieval [67.46975191141928]
ZS-CIR aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training.<n>One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space.<n>We propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set)
arXiv Detail & Related papers (2025-03-07T07:49:31Z)
Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval [16.995114000869833]
We propose CMPAGL, a cross-modal pre-aligned method leveraging global and local information. Our Gswin transformer block combines local window self-attention and global-local window cross-attention to capture multi-scale features. Experiments on four datasets, including RSICD and RSITMD, validate CMPAGL's effectiveness.
arXiv Detail & Related papers (2024-11-22T03:28:55Z)
Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs. We propose a more realistic setting in which only noisy text and its NER labels are available. We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z)
Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval [43.47770490199544]
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption. We introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations. We also introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed.
arXiv Detail & Related papers (2024-05-01T15:19:54Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
Toward Real Text Manipulation Detection: New Dataset and New Solution [58.557504531896704]
High costs associated with professional text manipulation limit the availability of real-world datasets. We present the Real Text Manipulation dataset, encompassing 14,250 text images. Our contributions aim to propel advancements in real-world text tampering detection.
arXiv Detail & Related papers (2023-12-12T02:10:16Z)
Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z)
PAIF: Perception-Aware Infrared-Visible Image Fusion for Attack-Tolerant Semantic Segmentation [50.556961575275345]
We propose a perception-aware fusion framework to promote segmentation robustness in adversarial scenes. We show that our scheme substantially enhances the robustness, with gains of 15.3% mIOU, compared with advanced competitors.
arXiv Detail & Related papers (2023-08-08T01:55:44Z)
Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval [21.05804942940532]
Cross-modal text-image retrieval has attracted extensive attention for its advantages of flexible input and efficient query. To cope with the problem of multi-scale scarcity and target redundancy in RS multimodal retrieval task, we come up with a novel asymmetric multimodal feature matching network (AMFMN) Our model adapts to multi-scale feature inputs, favors multi-source retrieval methods, and can dynamically filter redundant features.
arXiv Detail & Related papers (2022-04-21T03:53:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.