Scale-Semantic Joint Decoupling Network for Image-text Retrieval in
Remote Sensing
- URL: http://arxiv.org/abs/2212.05752v1
- Date: Mon, 12 Dec 2022 08:02:35 GMT
- Title: Scale-Semantic Joint Decoupling Network for Image-text Retrieval in
Remote Sensing
- Authors: Chengyu Zheng, Ning song, Ruoyu Zhang, Lei Huang, Zhiqiang Wei, Jie
Nie (corresponding author)
- Abstract summary: We propose a novel Scale-Semantic Joint Decoupling Network (SJDN) for remote sensing image-text retrieval.
Our proposed SSJDN outperforms state-of-the-art approaches in numerical experiments conducted on four benchmark remote sensing datasets.
- Score: 23.598273691455503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text retrieval in remote sensing aims to provide flexible information
for data analysis and application. In recent years, state-of-the-art methods
are dedicated to ``scale decoupling'' and ``semantic decoupling'' strategies to
further enhance the capability of representation. However, these previous
approaches focus on either the disentangling scale or semantics but ignore
merging these two ideas in a union model, which extremely limits the
performance of cross-modal retrieval models. To address these issues, we
propose a novel Scale-Semantic Joint Decoupling Network (SSJDN) for remote
sensing image-text retrieval. Specifically, we design the Bidirectional Scale
Decoupling (BSD) module, which exploits Salience Feature Extraction (SFE) and
Salience-Guided Suppression (SGS) units to adaptively extract potential
features and suppress cumbersome features at other scales in a bidirectional
pattern to yield different scale clues. Besides, we design the Label-supervised
Semantic Decoupling (LSD) module by leveraging the category semantic labels as
prior knowledge to supervise images and texts probing significant
semantic-related information. Finally, we design a Semantic-guided Triple Loss
(STL), which adaptively generates a constant to adjust the loss function to
improve the probability of matching the same semantic image and text and
shorten the convergence time of the retrieval model. Our proposed SSJDN
outperforms state-of-the-art approaches in numerical experiments conducted on
four benchmark remote sensing datasets.
Related papers
- Semantic-aware Representation Learning for Homography Estimation [28.70450397793246]
We propose SRMatcher, a detector-free feature matching method, which encourages the network to learn integrated semantic feature representation.
By reducing errors stemming from semantic inconsistencies in matching pairs, our proposed SRMatcher is able to deliver more accurate and realistic outcomes.
arXiv Detail & Related papers (2024-07-18T08:36:28Z) - Dual-stream contrastive predictive network with joint handcrafted
feature view for SAR ship classification [9.251342335645765]
We propose a novel dual-stream contrastive predictive network (DCPNet)
The first task is to construct positive sample pairs, guiding the core encoder to learn more general representations.
The second task is to encourage adaptive capture of the correspondence between deep features and handcrated features, achieving knowledge transfer within the model, and effectively improving the redundancy caused by the feature fusion.
arXiv Detail & Related papers (2023-11-26T05:47:01Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - ContraFeat: Contrasting Deep Features for Semantic Discovery [102.4163768995288]
StyleGAN has shown strong potential for disentangled semantic control.
Existing semantic discovery methods on StyleGAN rely on manual selection of modified latent layers to obtain satisfactory manipulation results.
We propose a model that automates this process and achieves state-of-the-art semantic discovery performance.
arXiv Detail & Related papers (2022-12-14T15:22:13Z) - Similarity-Aware Fusion Network for 3D Semantic Segmentation [87.51314162700315]
We propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation.
We employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds.
We show that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
arXiv Detail & Related papers (2021-07-04T09:28:18Z) - Graph Pattern Loss based Diversified Attention Network for Cross-Modal
Retrieval [10.420129873840578]
Cross-modal retrieval aims to enable flexible retrieval experience by combining multimedia data such as image, video, text, and audio.
One core of unsupervised approaches is to dig the correlations among different object representations to complete satisfied retrieval performance without requiring expensive labels.
We propose a Graph Pattern Loss based Diversified Attention Network(GPLDAN) for unsupervised cross-modal retrieval.
arXiv Detail & Related papers (2021-06-25T10:53:07Z) - Dual Attention GANs for Semantic Image Synthesis [101.36015877815537]
We propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images.
We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM)
DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters.
arXiv Detail & Related papers (2020-08-29T17:49:01Z) - Deep Semantic Matching with Foreground Detection and Cycle-Consistency [103.22976097225457]
We address weakly supervised semantic matching based on a deep network.
We explicitly estimate the foreground regions to suppress the effect of background clutter.
We develop cycle-consistent losses to enforce the predicted transformations across multiple images to be geometrically plausible and consistent.
arXiv Detail & Related papers (2020-03-31T22:38:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.