An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training
Image-Text Correspondences in Remote Sensing
- URL: http://arxiv.org/abs/2202.13117v1
- Date: Sat, 26 Feb 2022 11:22:24 GMT
- Title: An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training
Image-Text Correspondences in Remote Sensing
- Authors: Georgii Mikriukov, Mahdyar Ravanbakhsh, Beg\"um Demir
- Abstract summary: Cross-modal image-text retrieval methods have attracted great attention in remote sensing.
Most of the existing methods assume that a reliable multi-modal training set with accurately matched text-image pairs is existing.
We propose a novel unsupervised cross-modal hashing method robust to the noisy image-text correspondences (CHNR)
Experimental results show that the proposed CHNR outperforms state-of-the-art methods.
- Score: 1.6758573326215689
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of accurate and scalable cross-modal image-text retrieval
methods, where queries from one modality (e.g., text) can be matched to archive
entries from another (e.g., remote sensing image) has attracted great attention
in remote sensing (RS). Most of the existing methods assume that a reliable
multi-modal training set with accurately matched text-image pairs is existing.
However, this assumption may not always hold since the multi-modal training
sets may include noisy pairs (i.e., textual descriptions/captions associated to
training images can be noisy), distorting the learning process of the retrieval
methods. To address this problem, we propose a novel unsupervised cross-modal
hashing method robust to the noisy image-text correspondences (CHNR). CHNR
consists of three modules: 1) feature extraction module, which extracts feature
representations of image-text pairs; 2) noise detection module, which detects
potential noisy correspondences; and 3) hashing module that generates
cross-modal binary hash codes. The proposed CHNR includes two training phases:
i) meta-learning phase that uses a small portion of clean (i.e., reliable) data
to train the noise detection module in an adversarial fashion; and ii) the main
training phase for which the trained noise detection module is used to identify
noisy correspondences while the hashing module is trained on the noisy
multi-modal training set. Experimental results show that the proposed CHNR
outperforms state-of-the-art methods. Our code is publicly available at
https://git.tu-berlin.de/rsim/chnr
Related papers
- Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Semi-supervised Text-based Person Search [47.14739994781334]
Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning.
We present a two-stage basic solution based on generation-then-retrieval for semi-supervised TBPS.
We propose a noise-robust retrieval framework that enhances the ability of the retrieval model to handle noisy data.
arXiv Detail & Related papers (2024-04-28T07:47:52Z) - Noisy Pair Corrector for Dense Retrieval [59.312376423104055]
We propose a novel approach called Noisy Pair Corrector (NPC)
NPC consists of a detection module and a correction module.
We conduct experiments on text-retrieval benchmarks Natural Question and TriviaQA, code-search benchmarks StaQC and SO-DS.
arXiv Detail & Related papers (2023-11-07T08:27:14Z) - Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences.
Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - Unsupervised Contrastive Hashing for Cross-Modal Retrieval in Remote
Sensing [1.6758573326215689]
Cross-modal text-image retrieval has attracted great attention in remote sensing.
We introduce a novel unsupervised cross-modal contrastive hashing (DUCH) method for text-image retrieval in RS.
Experimental results show that the proposed DUCH outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-04-19T07:25:25Z) - BatchFormerV2: Exploring Sample Relationships for Dense Representation
Learning [88.82371069668147]
BatchFormerV2 is a more general batch Transformer module, which enables exploring sample relationships for dense representation learning.
BatchFormerV2 consistently improves current DETR-based detection methods by over 1.3%.
arXiv Detail & Related papers (2022-04-04T05:53:42Z) - Deep Unsupervised Contrastive Hashing for Large-Scale Cross-Modal
Text-Image Retrieval in Remote Sensing [1.6758573326215689]
We introduce a novel deep unsupervised cross-modal contrastive hashing (DUCH) method for RS text-image retrieval.
Experimental results show that the proposed DUCH outperforms state-of-the-art unsupervised cross-modal hashing methods.
Our code is publicly available at https://git.tu-berlin.de/rsim/duch.
arXiv Detail & Related papers (2022-01-20T12:05:10Z) - MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding [40.24656027709833]
We propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query.
We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model.
Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR.
arXiv Detail & Related papers (2021-04-26T17:55:33Z) - Unsupervised Deep Cross-modality Spectral Hashing [65.3842441716661]
The framework is a two-step hashing approach which decouples the optimization into binary optimization and hashing function learning.
We propose a novel spectral embedding-based algorithm to simultaneously learn single-modality and binary cross-modality representations.
We leverage the powerful CNN for images and propose a CNN-based deep architecture to learn text modality.
arXiv Detail & Related papers (2020-08-01T09:20:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.