Related papers: A Recipe for Efficient SBIR Models: Combining Relative Triplet Loss with Batch Normalization and Knowledge Distillation

A Recipe for Efficient SBIR Models: Combining Relative Triplet Loss with Batch Normalization and Knowledge Distillation

URL: http://arxiv.org/abs/2305.18988v1
Date: Tue, 30 May 2023 12:41:04 GMT
Title: A Recipe for Efficient SBIR Models: Combining Relative Triplet Loss with Batch Normalization and Knowledge Distillation
Authors: Omar Seddati, Nathan Hubens, St\'ephane Dupont, Thierry Dutoit
Abstract summary: Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query. We introduce a Relative Triplet Loss (RTL), an adapted triplet loss to overcome limitations through loss weighting based on anchors similarity. We propose a straightforward approach to train small models efficiently with a marginal loss of accuracy through knowledge distillation.
Score: 3.364554138758565
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query. Researchers have already proposed several well-performing solutions for this task, but most focus on enhancing embedding through different approaches such as triplet loss, quadruplet loss, adding data augmentation, and using edge extraction. In this work, we tackle the problem from various angles. We start by examining the training data quality and show some of its limitations. Then, we introduce a Relative Triplet Loss (RTL), an adapted triplet loss to overcome those limitations through loss weighting based on anchors similarity. Through a series of experiments, we demonstrate that replacing a triplet loss with RTL outperforms previous state-of-the-art without the need for any data augmentation. In addition, we demonstrate why batch normalization is more suited for SBIR embeddings than l2-normalization and show that it improves significantly the performance of our models. We further investigate the capacity of models required for the photo and sketch domains and demonstrate that the photo encoder requires a higher capacity than the sketch encoder, which validates the hypothesis formulated in [34]. Then, we propose a straightforward approach to train small models, such as ShuffleNetv2 [22] efficiently with a marginal loss of accuracy through knowledge distillation. The same approach used with larger models enabled us to outperform previous state-of-the-art results and achieve a recall of 62.38% at k = 1 on The Sketchy Database [30].

Related papers

Efficient One-Step Diffusion Refinement for Snapshot Compressive Imaging [8.819370643243012]
Coded Aperture Snapshot Spectral Imaging (CASSI) is a crucial technique for capturing three-dimensional multispectral images (MSIs) Current state-of-the-art methods, predominantly end-to-end, face limitations in reconstructing high-frequency details. This paper introduces a novel one-step Diffusion Probabilistic Model within a self-supervised adaptation framework for Snapshot Compressive Imaging.
arXiv Detail & Related papers (2024-09-11T17:02:10Z)
Pseudo-triplet Guided Few-shot Composed Image Retrieval [20.040511832864503]
Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image with a multimodal query. We propose a novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the first stage, we propose an attentive masking and captioning-based pseudo triplet generation method, to construct pseudo triplets from pure image data. In the second stage, we propose a challenging triplet-based CIR fine-tuning method, where we design a pseudo modification text-based sample challenging score estimation strategy.
arXiv Detail & Related papers (2024-07-08T14:53:07Z)
LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization [0.9562145896371785]
We apply Contrastive Language-Image Pre-Training to the domains of 2D image and 3D LiDAR points on the task of cross-modal localization. Our method outperforms state-of-the-art recall@1 accuracy on the KITTI-360 dataset by 22.4%, using only perspective images. We also demonstrate the zero-shot capabilities of our model and we beat SOTA by 8% without even training on it.
arXiv Detail & Related papers (2023-12-27T17:23:57Z)
Sample Less, Learn More: Efficient Action Recognition via Frame Feature Restoration [59.6021678234829]
We propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames. With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy.
arXiv Detail & Related papers (2023-07-27T13:52:42Z)
Class Anchor Margin Loss for Content-Based Image Retrieval [97.81742911657497]
We propose a novel repeller-attractor loss that falls in the metric learning paradigm, yet directly optimize for the L2 metric without the need of generating pairs. We evaluate the proposed objective in the context of few-shot and full-set training on the CBIR task, by using both convolutional and transformer architectures.
arXiv Detail & Related papers (2023-06-01T12:53:10Z)
Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR [103.51937218213774]
This paper advances the fine-grained sketch-based image retrieval (FG-SBIR) literature by putting forward a strong baseline that overshoots prior state-of-the-arts by 11%. We propose a simple modification to the standard triplet loss, that explicitly enforces separation amongst photos/sketch instances. For (i) we employ an intra-modal triplet loss amongst sketches to bring sketches of the same instance closer from others, and one more amongst photos to push away different photo instances.
arXiv Detail & Related papers (2023-03-24T03:34:33Z)
Transformers and CNNs both Beat Humans on SBIR [3.364554138758565]
Sketch-based image retrieval (SBIR) is the task of retrieving natural images (photos) that match the semantics of hand-drawn sketch queries. In this paper, we study classic triplet-based solutions and show that a persistent invariance to horizontal flip (even after model fine) is harming performance. Our best model achieves a recall of 62.25% (at k = 1) on the sketchy benchmark compared to previous state-of-the-art methods 46.2%.
arXiv Detail & Related papers (2022-09-14T13:28:37Z)
Towards Lightweight Super-Resolution with Dual Regression Learning [58.98801753555746]
Deep neural networks have exhibited remarkable performance in image super-resolution (SR) tasks. The SR problem is typically an ill-posed problem and existing methods would come with several limitations. We propose a dual regression learning scheme to reduce the space of possible SR mappings.
arXiv Detail & Related papers (2022-07-16T12:46:10Z)
Stable Optimization for Large Vision Model Based Deep Image Prior in Cone-Beam CT Reconstruction [6.558735319783205]
Large Vision Model (LVM) has recently demonstrated great potential for medical imaging tasks. Deep Image Prior (DIP) effectively guides an untrained neural network to generate high-quality CBCT images without any training data. We propose a stable optimization method for the forward-model-free DIP model for sparse-view CBCT.
arXiv Detail & Related papers (2022-03-23T15:16:29Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
More Photos are All You Need: Semi-Supervised Learning for Fine-Grained Sketch Based Image Retrieval [112.1756171062067]
We introduce a novel semi-supervised framework for cross-modal retrieval. At the centre of our design is a sequential photo-to-sketch generation model. We also introduce a discriminator guided mechanism to guide against unfaithful generation.
arXiv Detail & Related papers (2021-03-25T17:27:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.