A Recipe for Efficient SBIR Models: Combining Relative Triplet Loss with
Batch Normalization and Knowledge Distillation
- URL: http://arxiv.org/abs/2305.18988v1
- Date: Tue, 30 May 2023 12:41:04 GMT
- Title: A Recipe for Efficient SBIR Models: Combining Relative Triplet Loss with
Batch Normalization and Knowledge Distillation
- Authors: Omar Seddati, Nathan Hubens, St\'ephane Dupont, Thierry Dutoit
- Abstract summary: Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query.
We introduce a Relative Triplet Loss (RTL), an adapted triplet loss to overcome limitations through loss weighting based on anchors similarity.
We propose a straightforward approach to train small models efficiently with a marginal loss of accuracy through knowledge distillation.
- Score: 3.364554138758565
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia
retrieval, where the goal is to retrieve a set of images that match a given
sketch query. Researchers have already proposed several well-performing
solutions for this task, but most focus on enhancing embedding through
different approaches such as triplet loss, quadruplet loss, adding data
augmentation, and using edge extraction. In this work, we tackle the problem
from various angles. We start by examining the training data quality and show
some of its limitations. Then, we introduce a Relative Triplet Loss (RTL), an
adapted triplet loss to overcome those limitations through loss weighting based
on anchors similarity. Through a series of experiments, we demonstrate that
replacing a triplet loss with RTL outperforms previous state-of-the-art without
the need for any data augmentation. In addition, we demonstrate why batch
normalization is more suited for SBIR embeddings than l2-normalization and show
that it improves significantly the performance of our models. We further
investigate the capacity of models required for the photo and sketch domains
and demonstrate that the photo encoder requires a higher capacity than the
sketch encoder, which validates the hypothesis formulated in [34]. Then, we
propose a straightforward approach to train small models, such as ShuffleNetv2
[22] efficiently with a marginal loss of accuracy through knowledge
distillation. The same approach used with larger models enabled us to
outperform previous state-of-the-art results and achieve a recall of 62.38% at
k = 1 on The Sketchy Database [30].
Related papers
- Efficient One-Step Diffusion Refinement for Snapshot Compressive Imaging [8.819370643243012]
Coded Aperture Snapshot Spectral Imaging (CASSI) is a crucial technique for capturing three-dimensional multispectral images (MSIs)
Current state-of-the-art methods, predominantly end-to-end, face limitations in reconstructing high-frequency details.
This paper introduces a novel one-step Diffusion Probabilistic Model within a self-supervised adaptation framework for Snapshot Compressive Imaging.
arXiv Detail & Related papers (2024-09-11T17:02:10Z) - Pseudo-triplet Guided Few-shot Composed Image Retrieval [20.040511832864503]
Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image with a multimodal query.
We propose a novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR.
In the first stage, we propose an attentive masking and captioning-based pseudo triplet generation method, to construct pseudo triplets from pure image data.
In the second stage, we propose a challenging triplet-based CIR fine-tuning method, where we design a pseudo modification text-based sample challenging score estimation strategy.
arXiv Detail & Related papers (2024-07-08T14:53:07Z) - LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization [0.9562145896371785]
We apply Contrastive Language-Image Pre-Training to the domains of 2D image and 3D LiDAR points on the task of cross-modal localization.
Our method outperforms state-of-the-art recall@1 accuracy on the KITTI-360 dataset by 22.4%, using only perspective images.
We also demonstrate the zero-shot capabilities of our model and we beat SOTA by 8% without even training on it.
arXiv Detail & Related papers (2023-12-27T17:23:57Z) - Sample Less, Learn More: Efficient Action Recognition via Frame Feature
Restoration [59.6021678234829]
We propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames.
With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy.
arXiv Detail & Related papers (2023-07-27T13:52:42Z) - Class Anchor Margin Loss for Content-Based Image Retrieval [97.81742911657497]
We propose a novel repeller-attractor loss that falls in the metric learning paradigm, yet directly optimize for the L2 metric without the need of generating pairs.
We evaluate the proposed objective in the context of few-shot and full-set training on the CBIR task, by using both convolutional and transformer architectures.
arXiv Detail & Related papers (2023-06-01T12:53:10Z) - Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR [103.51937218213774]
This paper advances the fine-grained sketch-based image retrieval (FG-SBIR) literature by putting forward a strong baseline that overshoots prior state-of-the-arts by 11%.
We propose a simple modification to the standard triplet loss, that explicitly enforces separation amongst photos/sketch instances.
For (i) we employ an intra-modal triplet loss amongst sketches to bring sketches of the same instance closer from others, and one more amongst photos to push away different photo instances.
arXiv Detail & Related papers (2023-03-24T03:34:33Z) - Transformers and CNNs both Beat Humans on SBIR [3.364554138758565]
Sketch-based image retrieval (SBIR) is the task of retrieving natural images (photos) that match the semantics of hand-drawn sketch queries.
In this paper, we study classic triplet-based solutions and show that a persistent invariance to horizontal flip (even after model fine) is harming performance.
Our best model achieves a recall of 62.25% (at k = 1) on the sketchy benchmark compared to previous state-of-the-art methods 46.2%.
arXiv Detail & Related papers (2022-09-14T13:28:37Z) - Towards Lightweight Super-Resolution with Dual Regression Learning [58.98801753555746]
Deep neural networks have exhibited remarkable performance in image super-resolution (SR) tasks.
The SR problem is typically an ill-posed problem and existing methods would come with several limitations.
We propose a dual regression learning scheme to reduce the space of possible SR mappings.
arXiv Detail & Related papers (2022-07-16T12:46:10Z) - Stable Optimization for Large Vision Model Based Deep Image Prior in
Cone-Beam CT Reconstruction [6.558735319783205]
Large Vision Model (LVM) has recently demonstrated great potential for medical imaging tasks.
Deep Image Prior (DIP) effectively guides an untrained neural network to generate high-quality CBCT images without any training data.
We propose a stable optimization method for the forward-model-free DIP model for sparse-view CBCT.
arXiv Detail & Related papers (2022-03-23T15:16:29Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - More Photos are All You Need: Semi-Supervised Learning for Fine-Grained
Sketch Based Image Retrieval [112.1756171062067]
We introduce a novel semi-supervised framework for cross-modal retrieval.
At the centre of our design is a sequential photo-to-sketch generation model.
We also introduce a discriminator guided mechanism to guide against unfaithful generation.
arXiv Detail & Related papers (2021-03-25T17:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.