Training Vision Transformers for Image Retrieval
- URL: http://arxiv.org/abs/2102.05644v1
- Date: Wed, 10 Feb 2021 18:56:41 GMT
- Title: Training Vision Transformers for Image Retrieval
- Authors: Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Herv\'e J\'egou
- Abstract summary: We adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective.
Our results show consistent and significant improvements of transformers over convolution-based approaches.
- Score: 32.09708181236154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have shown outstanding results for natural language
understanding and, more recently, for image classification. We here extend this
work and propose a transformer-based approach for image retrieval: we adopt
vision transformers for generating image descriptors and train the resulting
model with a metric learning objective, which combines a contrastive loss with
a differential entropy regularizer. Our results show consistent and significant
improvements of transformers over convolution-based approaches. In particular,
our method outperforms the state of the art on several public benchmarks for
category-level retrieval, namely Stanford Online Product, In-Shop and CUB-200.
Furthermore, our experiments on ROxford and RParis also show that, in
comparable settings, transformers are competitive for particular object
retrieval, especially in the regime of short vector representations and
low-resolution images.
Related papers
- Evidential Transformers for Improved Image Retrieval [7.397099215417549]
We introduce the Evidential Transformer, an uncertainty-driven transformer model for improved and robust image retrieval.
We incorporate probabilistic methods into image retrieval, achieving robust and reliable results.
arXiv Detail & Related papers (2024-09-02T09:10:47Z) - Image Deblurring by Exploring In-depth Properties of Transformer [86.7039249037193]
We leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics.
By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information.
One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space.
arXiv Detail & Related papers (2023-03-24T14:14:25Z) - Learning Explicit Object-Centric Representations with Vision
Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z) - Transformers For Recognition In Overhead Imagery: A Reality Check [0.0]
We compare the impact of adding transformer structures into state-of-the-art segmentation models for overhead imagery.
Our results suggest that transformers provide consistent, but modest, performance improvements.
arXiv Detail & Related papers (2022-10-23T02:17:31Z) - Towards End-to-End Image Compression and Analysis with Transformers [99.50111380056043]
We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application.
We aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer.
Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.
arXiv Detail & Related papers (2021-12-17T03:28:14Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.