Deep Unsupervised Contrastive Hashing for Large-Scale Cross-Modal
Text-Image Retrieval in Remote Sensing
- URL: http://arxiv.org/abs/2201.08125v1
- Date: Thu, 20 Jan 2022 12:05:10 GMT
- Title: Deep Unsupervised Contrastive Hashing for Large-Scale Cross-Modal
Text-Image Retrieval in Remote Sensing
- Authors: Georgii Mikriukov, Mahdyar Ravanbakhsh, Beg\"um Demir
- Abstract summary: We introduce a novel deep unsupervised cross-modal contrastive hashing (DUCH) method for RS text-image retrieval.
Experimental results show that the proposed DUCH outperforms state-of-the-art unsupervised cross-modal hashing methods.
Our code is publicly available at https://git.tu-berlin.de/rsim/duch.
- Score: 1.6758573326215689
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the availability of large-scale multi-modal data (e.g., satellite
images acquired by different sensors, text sentences, etc) archives, the
development of cross-modal retrieval systems that can search and retrieve
semantically relevant data across different modalities based on a query in any
modality has attracted great attention in RS. In this paper, we focus our
attention on cross-modal text-image retrieval, where queries from one modality
(e.g., text) can be matched to archive entries from another (e.g., image). Most
of the existing cross-modal text-image retrieval systems require a high number
of labeled training samples and also do not allow fast and memory-efficient
retrieval due to their intrinsic characteristics. These issues limit the
applicability of the existing cross-modal retrieval systems for large-scale
applications in RS. To address this problem, in this paper we introduce a novel
deep unsupervised cross-modal contrastive hashing (DUCH) method for RS
text-image retrieval. The proposed DUCH is made up of two main modules: 1)
feature extraction module (which extracts deep representations of the
text-image modalities); and 2) hashing module (which learns to generate
cross-modal binary hash codes from the extracted representations). Within the
hashing module, we introduce a novel multi-objective loss function including:
i) contrastive objectives that enable similarity preservation in both intra-
and inter-modal similarities; ii) an adversarial objective that is enforced
across two modalities for cross-modal representation consistency; iii)
binarization objectives for generating representative hash codes. Experimental
results show that the proposed DUCH outperforms state-of-the-art unsupervised
cross-modal hashing methods on two multi-modal (image and text) benchmark
archives in RS. Our code is publicly available at
https://git.tu-berlin.de/rsim/duch.
Related papers
- ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval.
ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID.
Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain.
EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z) - Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote
Sensing Image Retrieval [21.05804942940532]
Cross-modal text-image retrieval has attracted extensive attention for its advantages of flexible input and efficient query.
To cope with the problem of multi-scale scarcity and target redundancy in RS multimodal retrieval task, we come up with a novel asymmetric multimodal feature matching network (AMFMN)
Our model adapts to multi-scale feature inputs, favors multi-source retrieval methods, and can dynamically filter redundant features.
arXiv Detail & Related papers (2022-04-21T03:53:19Z) - Unsupervised Contrastive Hashing for Cross-Modal Retrieval in Remote
Sensing [1.6758573326215689]
Cross-modal text-image retrieval has attracted great attention in remote sensing.
We introduce a novel unsupervised cross-modal contrastive hashing (DUCH) method for text-image retrieval in RS.
Experimental results show that the proposed DUCH outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-04-19T07:25:25Z) - An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training
Image-Text Correspondences in Remote Sensing [1.6758573326215689]
Cross-modal image-text retrieval methods have attracted great attention in remote sensing.
Most of the existing methods assume that a reliable multi-modal training set with accurately matched text-image pairs is existing.
We propose a novel unsupervised cross-modal hashing method robust to the noisy image-text correspondences (CHNR)
Experimental results show that the proposed CHNR outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-02-26T11:22:24Z) - Efficient Cross-Modal Retrieval via Deep Binary Hashing and Quantization [5.799838997511804]
Cross-modal retrieval aims to search for data with similar semantic meanings across different content modalities.
We propose a jointly learned deep hashing and quantization network (HQ) for cross-modal retrieval.
Experimental results on the NUS-WIDE, MIR-Flickr, and Amazon datasets demonstrate that HQ achieves boosts of more than 7% in precision.
arXiv Detail & Related papers (2022-02-15T22:00:04Z) - Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image.
We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model.
Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z) - Unsupervised Deep Cross-modality Spectral Hashing [65.3842441716661]
The framework is a two-step hashing approach which decouples the optimization into binary optimization and hashing function learning.
We propose a novel spectral embedding-based algorithm to simultaneously learn single-modality and binary cross-modality representations.
We leverage the powerful CNN for images and propose a CNN-based deep architecture to learn text modality.
arXiv Detail & Related papers (2020-08-01T09:20:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.