Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote
Sensing Image Retrieval
- URL: http://arxiv.org/abs/2204.09868v1
- Date: Thu, 21 Apr 2022 03:53:19 GMT
- Title: Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote
Sensing Image Retrieval
- Authors: Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang,
and Xian Sun
- Abstract summary: Cross-modal text-image retrieval has attracted extensive attention for its advantages of flexible input and efficient query.
To cope with the problem of multi-scale scarcity and target redundancy in RS multimodal retrieval task, we come up with a novel asymmetric multimodal feature matching network (AMFMN)
Our model adapts to multi-scale feature inputs, favors multi-source retrieval methods, and can dynamically filter redundant features.
- Score: 21.05804942940532
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Remote sensing (RS) cross-modal text-image retrieval has attracted extensive
attention for its advantages of flexible input and efficient query. However,
traditional methods ignore the characteristics of multi-scale and redundant
targets in RS image, leading to the degradation of retrieval accuracy. To cope
with the problem of multi-scale scarcity and target redundancy in RS multimodal
retrieval task, we come up with a novel asymmetric multimodal feature matching
network (AMFMN). Our model adapts to multi-scale feature inputs, favors
multi-source retrieval methods, and can dynamically filter redundant features.
AMFMN employs the multi-scale visual self-attention (MVSA) module to extract
the salient features of RS image and utilizes visual features to guide the text
representation. Furthermore, to alleviate the positive samples ambiguity caused
by the strong intraclass similarity in RS image, we propose a triplet loss
function with dynamic variable margin based on prior similarity of sample
pairs. Finally, unlike the traditional RS image-text dataset with coarse text
and higher intraclass similarity, we construct a fine-grained and more
challenging Remote sensing Image-Text Match dataset (RSITMD), which supports RS
image retrieval through keywords and sentence separately and jointly.
Experiments on four RS text-image datasets demonstrate that the proposed model
can achieve state-of-the-art performance in cross-modal RS text-image retrieval
task.
Related papers
- MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation [25.252173311925027]
We propose a Multi-modal, Multi-GSD, Multi-scene Remote Sensing (MMM-RS) dataset and benchmark for text-to-image generation in diverse remote sensing scenarios.
We utilize a large-scale pretrained vision-language model to automatically output text prompts and perform hand-crafted rectification, resulting in information-rich text-image pairs.
With extensive manual screening and refining annotations, we ultimately obtain a MMM-RS dataset that comprises approximately 2.1 million text-image pairs.
arXiv Detail & Related papers (2024-10-26T11:19:07Z) - OpticalRS-4M: Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset [66.15872913664407]
We present a new pre-training pipeline for RS models, featuring the creation of a large-scale RS dataset and an efficient MIM approach.
We curated a high-quality dataset named OpticalRS-4M by collecting publicly available RS datasets and processing them through exclusion, slicing, and deduplication.
Experiments demonstrate that OpticalRS-4M significantly improves classification, detection, and segmentation performance, while SelectiveMAE increases training efficiency over 2 times.
arXiv Detail & Related papers (2024-06-17T15:41:57Z) - Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval [37.775529830620016]
Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain.
Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately.
We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation.
arXiv Detail & Related papers (2024-05-29T10:19:11Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image
Captioning [49.48946808024608]
We propose a novel two-stage vision-language pre-training-based approach to bootstrap interactive image-text alignment for remote sensing image captioning, called BITA.
Specifically, the first stage involves preliminary alignment through image-text contrastive learning.
In the second stage, the interactive Fourier Transformer connects the frozen image encoder with a large language model.
arXiv Detail & Related papers (2023-12-02T17:32:17Z) - Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network.
We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z) - Unsupervised Contrastive Hashing for Cross-Modal Retrieval in Remote
Sensing [1.6758573326215689]
Cross-modal text-image retrieval has attracted great attention in remote sensing.
We introduce a novel unsupervised cross-modal contrastive hashing (DUCH) method for text-image retrieval in RS.
Experimental results show that the proposed DUCH outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-04-19T07:25:25Z) - Deep Unsupervised Contrastive Hashing for Large-Scale Cross-Modal
Text-Image Retrieval in Remote Sensing [1.6758573326215689]
We introduce a novel deep unsupervised cross-modal contrastive hashing (DUCH) method for RS text-image retrieval.
Experimental results show that the proposed DUCH outperforms state-of-the-art unsupervised cross-modal hashing methods.
Our code is publicly available at https://git.tu-berlin.de/rsim/duch.
arXiv Detail & Related papers (2022-01-20T12:05:10Z) - Cross-Modality Sub-Image Retrieval using Contrastive Multimodal Image
Representations [3.3754780158324564]
Cross-modality image retrieval is challenging, since images of similar (or even the same) content captured by different modalities might share few common structures.
We propose a new application-independent content-based image retrieval system for reverse (sub-)image search across modalities.
arXiv Detail & Related papers (2022-01-10T19:04:28Z) - Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task.
We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network.
Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z) - DDet: Dual-path Dynamic Enhancement Network for Real-World Image
Super-Resolution [69.2432352477966]
Real image super-resolution(Real-SR) focus on the relationship between real-world high-resolution(HR) and low-resolution(LR) image.
In this article, we propose a Dual-path Dynamic Enhancement Network(DDet) for Real-SR.
Unlike conventional methods which stack up massive convolutional blocks for feature representation, we introduce a content-aware framework to study non-inherently aligned image pair.
arXiv Detail & Related papers (2020-02-25T18:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.