Scale-Semantic Joint Decoupling Network for Image-text Retrieval in
Remote Sensing
- URL: http://arxiv.org/abs/2212.05752v1
- Date: Mon, 12 Dec 2022 08:02:35 GMT
- Title: Scale-Semantic Joint Decoupling Network for Image-text Retrieval in
Remote Sensing
- Authors: Chengyu Zheng, Ning song, Ruoyu Zhang, Lei Huang, Zhiqiang Wei, Jie
Nie (corresponding author)
- Abstract summary: We propose a novel Scale-Semantic Joint Decoupling Network (SJDN) for remote sensing image-text retrieval.
Our proposed SSJDN outperforms state-of-the-art approaches in numerical experiments conducted on four benchmark remote sensing datasets.
- Score: 23.598273691455503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text retrieval in remote sensing aims to provide flexible information
for data analysis and application. In recent years, state-of-the-art methods
are dedicated to ``scale decoupling'' and ``semantic decoupling'' strategies to
further enhance the capability of representation. However, these previous
approaches focus on either the disentangling scale or semantics but ignore
merging these two ideas in a union model, which extremely limits the
performance of cross-modal retrieval models. To address these issues, we
propose a novel Scale-Semantic Joint Decoupling Network (SSJDN) for remote
sensing image-text retrieval. Specifically, we design the Bidirectional Scale
Decoupling (BSD) module, which exploits Salience Feature Extraction (SFE) and
Salience-Guided Suppression (SGS) units to adaptively extract potential
features and suppress cumbersome features at other scales in a bidirectional
pattern to yield different scale clues. Besides, we design the Label-supervised
Semantic Decoupling (LSD) module by leveraging the category semantic labels as
prior knowledge to supervise images and texts probing significant
semantic-related information. Finally, we design a Semantic-guided Triple Loss
(STL), which adaptively generates a constant to adjust the loss function to
improve the probability of matching the same semantic image and text and
shorten the convergence time of the retrieval model. Our proposed SSJDN
outperforms state-of-the-art approaches in numerical experiments conducted on
four benchmark remote sensing datasets.
Related papers
- AerOSeg: Harnessing SAM for Open-Vocabulary Segmentation in Remote Sensing Images [21.294581646546124]
AerOSeg is a novel Open-Vocabulary (OVS) approach for remote sensing data.
We compute robust image-text correlation features using rotated versions of the input image and domain-specific prompts.
Inspired by the success of the Segment Anything Model (SAM) in diverse domains, we leverage SAM features to guide the spatial refinement of correlation features.
We enhance the refined correlation features using a multi-scale attention-aware composition to produce the final segmentation map.
arXiv Detail & Related papers (2025-04-12T13:06:46Z) - Data-Efficient Generalization for Zero-shot Composed Image Retrieval [67.46975191141928]
ZS-CIR aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training.
One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space.
We propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set)
arXiv Detail & Related papers (2025-03-07T07:49:31Z) - SpecDM: Hyperspectral Dataset Synthesis with Pixel-level Semantic Annotations [27.391859339238906]
In this paper, we explore the potential of generative diffusion model in synthesizing hyperspectral images with pixel-level annotations.
To the best of our knowledge, it is the first work to generate high-dimensional HSIs with annotations.
We select two of the most widely used dense prediction tasks: semantic segmentation and change detection, and generate datasets suitable for these tasks.
arXiv Detail & Related papers (2025-02-24T11:13:37Z) - Semantic-aware Representation Learning for Homography Estimation [28.70450397793246]
We propose SRMatcher, a detector-free feature matching method, which encourages the network to learn integrated semantic feature representation.
By reducing errors stemming from semantic inconsistencies in matching pairs, our proposed SRMatcher is able to deliver more accurate and realistic outcomes.
arXiv Detail & Related papers (2024-07-18T08:36:28Z) - Dual-stream contrastive predictive network with joint handcrafted
feature view for SAR ship classification [9.251342335645765]
We propose a novel dual-stream contrastive predictive network (DCPNet)
The first task is to construct positive sample pairs, guiding the core encoder to learn more general representations.
The second task is to encourage adaptive capture of the correspondence between deep features and handcrated features, achieving knowledge transfer within the model, and effectively improving the redundancy caused by the feature fusion.
arXiv Detail & Related papers (2023-11-26T05:47:01Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - ContraFeat: Contrasting Deep Features for Semantic Discovery [102.4163768995288]
StyleGAN has shown strong potential for disentangled semantic control.
Existing semantic discovery methods on StyleGAN rely on manual selection of modified latent layers to obtain satisfactory manipulation results.
We propose a model that automates this process and achieves state-of-the-art semantic discovery performance.
arXiv Detail & Related papers (2022-12-14T15:22:13Z) - Similarity-Aware Fusion Network for 3D Semantic Segmentation [87.51314162700315]
We propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation.
We employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds.
We show that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
arXiv Detail & Related papers (2021-07-04T09:28:18Z) - Graph Pattern Loss based Diversified Attention Network for Cross-Modal
Retrieval [10.420129873840578]
Cross-modal retrieval aims to enable flexible retrieval experience by combining multimedia data such as image, video, text, and audio.
One core of unsupervised approaches is to dig the correlations among different object representations to complete satisfied retrieval performance without requiring expensive labels.
We propose a Graph Pattern Loss based Diversified Attention Network(GPLDAN) for unsupervised cross-modal retrieval.
arXiv Detail & Related papers (2021-06-25T10:53:07Z) - Dual Attention GANs for Semantic Image Synthesis [101.36015877815537]
We propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images.
We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM)
DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters.
arXiv Detail & Related papers (2020-08-29T17:49:01Z) - Deep Semantic Matching with Foreground Detection and Cycle-Consistency [103.22976097225457]
We address weakly supervised semantic matching based on a deep network.
We explicitly estimate the foreground regions to suppress the effect of background clutter.
We develop cycle-consistent losses to enforce the predicted transformations across multiple images to be geometrically plausible and consistent.
arXiv Detail & Related papers (2020-03-31T22:38:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.