Investigating the Vision Transformer Model for Image Retrieval Tasks
- URL: http://arxiv.org/abs/2101.03771v1
- Date: Mon, 11 Jan 2021 08:59:54 GMT
- Title: Investigating the Vision Transformer Model for Image Retrieval Tasks
- Authors: Socratis Gkelios, Yiannis Boutalis, Savvas A. Chatzichristofis
- Abstract summary: This paper introduces a plug-and-play descriptor that can be effectively adopted for image retrieval tasks without prior preparation.
The proposed description method utilizes the recently proposed Vision Transformer network while it does not require any training data to adjust parameters.
In image retrieval tasks, the use of global and local descriptors has been very successfully replaced, over the last years, by the Convolutional Neural Networks (CNN)-based methods.
- Score: 1.375062426766416
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a plug-and-play descriptor that can be effectively
adopted for image retrieval tasks without prior initialization or preparation.
The description method utilizes the recently proposed Vision Transformer
network while it does not require any training data to adjust parameters. In
image retrieval tasks, the use of Handcrafted global and local descriptors has
been very successfully replaced, over the last years, by the Convolutional
Neural Networks (CNN)-based methods. However, the experimental evaluation
conducted in this paper on several benchmarking datasets against 36
state-of-the-art descriptors from the literature demonstrates that a neural
network that contains no convolutional layer, such as Vision Transformer, can
shape a global descriptor and achieve competitive results. As fine-tuning is
not required, the presented methodology's low complexity encourages adoption of
the architecture as an image retrieval baseline model, replacing the
traditional and well adopted CNN-based approaches and inaugurating a new era in
image retrieval approaches.
Related papers
- T-TAME: Trainable Attention Mechanism for Explaining Convolutional
Networks and Vision Transformers [9.284740716447342]
"Black box" nature of neural networks is a barrier to adoption in applications where explainability is essential.
This paper presents T-TAME, Transformer-compatible Trainable Attention Mechanism for Explanations.
Proposed architecture and training technique can be easily applied to any convolutional or Vision Transformer-like neural network.
arXiv Detail & Related papers (2024-03-07T14:25:03Z) - Transformer-based Clipped Contrastive Quantization Learning for
Unsupervised Image Retrieval [15.982022297570108]
Unsupervised image retrieval aims to learn the important visual characteristics without any given level to retrieve the similar images for a given query image.
In this paper, we propose a TransClippedCLR model by encoding the global context of an image using Transformer having local context through patch based processing.
Results using the proposed clipped contrastive learning are greatly improved on all datasets as compared to same backbone network with vanilla contrastive learning.
arXiv Detail & Related papers (2024-01-27T09:39:11Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z) - Robustcaps: a transformation-robust capsule network for image
classification [6.445605125467574]
We present a deep neural network model that exhibits the desirable property of transformation-robustness.
Our model, termed RobustCaps, uses group-equivariant convolutions in an improved capsule network model.
It achieves state-of-the-art accuracies on CIFAR-10, FashionMNIST, and CIFAR-100 datasets.
arXiv Detail & Related papers (2022-10-20T08:42:33Z) - DELAD: Deep Landweber-guided deconvolution with Hessian and sparse prior [0.22940141855172028]
We present a model for non-blind image deconvolution that incorporates the classic iterative method into a deep learning application.
We build our network based on the iterative Landweber deconvolution algorithm, which is integrated with trainable convolutional layers to enhance the recovered image structures and details.
arXiv Detail & Related papers (2022-09-30T11:15:03Z) - Simple Open-Vocabulary Object Detection with Vision Transformers [51.57562920090721]
We propose a strong recipe for transferring image-text models to open-vocabulary object detection.
We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning.
We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection.
arXiv Detail & Related papers (2022-05-12T17:20:36Z) - Learning Transformer Features for Image Quality Assessment [53.51379676690971]
We propose a unified IQA framework that utilizes CNN backbone and transformer encoder to extract features.
The proposed framework is compatible with both FR and NR modes and allows for a joint training scheme.
arXiv Detail & Related papers (2021-12-01T13:23:00Z) - GLiT: Neural Architecture Search for Global and Local Image Transformer [114.8051035856023]
We introduce the first Neural Architecture Search (NAS) method to find a better transformer architecture for image recognition.
Our method can find more discriminative and efficient transformer variants than the ResNet family and the baseline ViT for image classification.
arXiv Detail & Related papers (2021-07-07T00:48:09Z) - Unsupervised Metric Relocalization Using Transform Consistency Loss [66.19479868638925]
Training networks to perform metric relocalization traditionally requires accurate image correspondences.
We propose a self-supervised solution, which exploits a key insight: localizing a query image within a map should yield the same absolute pose, regardless of the reference image used for registration.
We evaluate our framework on synthetic and real-world data, showing our approach outperforms other supervised methods when a limited amount of ground-truth information is available.
arXiv Detail & Related papers (2020-11-01T19:24:27Z) - Image Retrieval using Multi-scale CNN Features Pooling [26.811290793232313]
We present an end-to-end trainable network architecture that exploits a novel multi-scale local pooling based on NetVLAD and a triplet mining procedure based on samples difficulty to obtain an effective image representation.
arXiv Detail & Related papers (2020-04-21T00:57:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.