Modality-Aware Triplet Hard Mining for Zero-shot Sketch-Based Image
Retrieval
- URL: http://arxiv.org/abs/2112.07966v2
- Date: Thu, 16 Dec 2021 02:13:22 GMT
- Title: Modality-Aware Triplet Hard Mining for Zero-shot Sketch-Based Image
Retrieval
- Authors: Zongheng Huang, YiFan Sun, Chuchu Han, Changxin Gao, Nong Sang
- Abstract summary: This paper tackles the Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) problem from the viewpoint of cross-modality metric learning.
By combining two fundamental learning approaches in DML, e.g., classification training and pairwise training, we set up a strong baseline for ZS-SBIR.
We show that Modality-Aware Triplet Hard Mining (MATHM) enhances the baseline with three types of pairwise learning.
- Score: 51.42470171051007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper tackles the Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR)
problem from the viewpoint of cross-modality metric learning. This task has two
characteristics: 1) the zero-shot setting requires a metric space with good
within-class compactness and the between-class discrepancy for recognizing the
novel classes and 2) the sketch query and the photo gallery are in different
modalities. The metric learning viewpoint benefits ZS-SBIR from two aspects.
First, it facilitates improvement through recent good practices in deep metric
learning (DML). By combining two fundamental learning approaches in DML, e.g.,
classification training and pairwise training, we set up a strong baseline for
ZS-SBIR. Without bells and whistles, this baseline achieves competitive
retrieval accuracy. Second, it provides an insight that properly suppressing
the modality gap is critical. To this end, we design a novel method named
Modality-Aware Triplet Hard Mining (MATHM). MATHM enhances the baseline with
three types of pairwise learning, e.g., a cross-modality sample pair, a
within-modality sample pair, and their combination.\We also design an adaptive
weighting method to balance these three components during training dynamically.
Experimental results confirm that MATHM brings another round of significant
improvement based on the strong baseline and sets up new state-of-the-art
performance. For example, on the TU-Berlin dataset, we achieve 47.88+2.94%
mAP@all and 58.28+2.34% Prec@100. Code will be publicly available at:
https://github.com/huangzongheng/MATHM.
Related papers
- LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization [0.9562145896371785]
We apply Contrastive Language-Image Pre-Training to the domains of 2D image and 3D LiDAR points on the task of cross-modal localization.
Our method outperforms state-of-the-art recall@1 accuracy on the KITTI-360 dataset by 22.4%, using only perspective images.
We also demonstrate the zero-shot capabilities of our model and we beat SOTA by 8% without even training on it.
arXiv Detail & Related papers (2023-12-27T17:23:57Z) - Symmetrical Bidirectional Knowledge Alignment for Zero-Shot Sketch-Based
Image Retrieval [69.46139774646308]
This paper studies the problem of zero-shot sketch-based image retrieval (ZS-SBIR)
It aims to use sketches from unseen categories as queries to match the images of the same category.
We propose a novel Symmetrical Bidirectional Knowledge Alignment for zero-shot sketch-based image retrieval (SBKA)
arXiv Detail & Related papers (2023-12-16T04:50:34Z) - Two-Stage Triplet Loss Training with Curriculum Augmentation for
Audio-Visual Retrieval [3.164991885881342]
Cross- retrieval models learn robust embedding spaces.
We introduce a novel approach rooted in curriculum learning to address this problem.
We propose a two-stage training paradigm that guides the model's learning process from semi-hard to hard triplets.
arXiv Detail & Related papers (2023-10-20T12:35:54Z) - Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences.
Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Class Anchor Margin Loss for Content-Based Image Retrieval [97.81742911657497]
We propose a novel repeller-attractor loss that falls in the metric learning paradigm, yet directly optimize for the L2 metric without the need of generating pairs.
We evaluate the proposed objective in the context of few-shot and full-set training on the CBIR task, by using both convolutional and transformer architectures.
arXiv Detail & Related papers (2023-06-01T12:53:10Z) - Un-Mix: Rethinking Image Mixtures for Unsupervised Visual Representation
Learning [108.999497144296]
Recently advanced unsupervised learning approaches use the siamese-like framework to compare two "views" from the same image for learning representations.
This work aims to involve the distance concept on label space in the unsupervised learning and let the model be aware of the soft degree of similarity between positive or negative pairs.
Despite its conceptual simplicity, we show empirically that with the solution -- Unsupervised image mixtures (Un-Mix), we can learn subtler, more robust and generalized representations from the transformed input and corresponding new label space.
arXiv Detail & Related papers (2020-03-11T17:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.