Transformers and CNNs both Beat Humans on SBIR
- URL: http://arxiv.org/abs/2209.06629v1
- Date: Wed, 14 Sep 2022 13:28:37 GMT
- Title: Transformers and CNNs both Beat Humans on SBIR
- Authors: Omar Seddati, St\'ephane Dupont, Sa\"id Mahmoudi, Thierry Dutoit
- Abstract summary: Sketch-based image retrieval (SBIR) is the task of retrieving natural images (photos) that match the semantics of hand-drawn sketch queries.
In this paper, we study classic triplet-based solutions and show that a persistent invariance to horizontal flip (even after model fine) is harming performance.
Our best model achieves a recall of 62.25% (at k = 1) on the sketchy benchmark compared to previous state-of-the-art methods 46.2%.
- Score: 3.364554138758565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sketch-based image retrieval (SBIR) is the task of retrieving natural images
(photos) that match the semantics and the spatial configuration of hand-drawn
sketch queries. The universality of sketches extends the scope of possible
applications and increases the demand for efficient SBIR solutions. In this
paper, we study classic triplet-based SBIR solutions and show that a persistent
invariance to horizontal flip (even after model finetuning) is harming
performance. To overcome this limitation, we propose several approaches and
evaluate in depth each of them to check their effectiveness. Our main
contributions are twofold: We propose and evaluate several intuitive
modifications to build SBIR solutions with better flip equivariance. We show
that vision transformers are more suited for the SBIR task, and that they
outperform CNNs with a large margin. We carried out numerous experiments and
introduce the first models to outperform human performance on a large-scale
SBIR benchmark (Sketchy). Our best model achieves a recall of 62.25% (at k = 1)
on the sketchy benchmark compared to previous state-of-the-art methods 46.2%.
Related papers
- Exploiting Distribution Constraints for Scalable and Efficient Image Retrieval [1.6874375111244329]
State-of-the-art image retrieval systems train specific neural networks for each dataset.
Off-the-shelf foundation models fall short in achieving performance comparable to dataset-specific models.
We introduce Autoencoders with Strong Variance Constraints (AE-SVC), which significantly improves the performance of foundation models.
arXiv Detail & Related papers (2024-10-09T16:05:16Z) - A Simple and Generalist Approach for Panoptic Segmentation [57.94892855772925]
Generalist vision models aim for one and the same architecture for a variety of vision tasks.
While such shared architecture may seem attractive, generalist models tend to be outperformed by their bespoken counterparts.
We address this problem by introducing two key contributions, without compromising the desirable properties of generalist models.
arXiv Detail & Related papers (2024-08-29T13:02:12Z) - Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment and Multi-Scale Token Recycling [11.129453244307369]
FG-SBIR aims to minimize the distance between sketches and corresponding images in the embedding space.
We propose an effective approach to narrow the gap between the two domains.
It mainly facilitates unified mutual information sharing both intra- and inter-samples.
arXiv Detail & Related papers (2024-06-17T13:49:12Z) - Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers [7.89533262149443]
Self-attention in Transformers comes with a high computational cost because of their quadratic computational complexity.
Our benchmark shows that using a larger model in general is more efficient than using higher resolution images.
arXiv Detail & Related papers (2023-08-18T08:06:49Z) - Sample Less, Learn More: Efficient Action Recognition via Frame Feature
Restoration [59.6021678234829]
We propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames.
With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy.
arXiv Detail & Related papers (2023-07-27T13:52:42Z) - A Recipe for Efficient SBIR Models: Combining Relative Triplet Loss with
Batch Normalization and Knowledge Distillation [3.364554138758565]
Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query.
We introduce a Relative Triplet Loss (RTL), an adapted triplet loss to overcome limitations through loss weighting based on anchors similarity.
We propose a straightforward approach to train small models efficiently with a marginal loss of accuracy through knowledge distillation.
arXiv Detail & Related papers (2023-05-30T12:41:04Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - Blind Face Restoration: Benchmark Datasets and a Baseline Model [63.053331687284064]
Blind Face Restoration (BFR) aims to construct a high-quality (HQ) face image from its corresponding low-quality (LQ) input.
We first synthesize two blind face restoration benchmark datasets called EDFace-Celeb-1M (BFR128) and EDFace-Celeb-150K (BFR512)
State-of-the-art methods are benchmarked on them under five settings including blur, noise, low resolution, JPEG compression artifacts, and the combination of them (full degradation)
arXiv Detail & Related papers (2022-06-08T06:34:24Z) - Efficient Person Search: An Anchor-Free Approach [86.45858994806471]
Person search aims to simultaneously localize and identify a query person from realistic, uncropped images.
To achieve this goal, state-of-the-art models typically add a re-id branch upon two-stage detectors like Faster R-CNN.
In this work, we present an anchor-free approach to efficiently tackling this challenging task, by introducing the following dedicated designs.
arXiv Detail & Related papers (2021-09-01T07:01:33Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.