Related papers: Instance-level Image Retrieval using Reranking Transformers

Instance-level Image Retrieval using Reranking Transformers

URL: http://arxiv.org/abs/2103.12236v1
Date: Mon, 22 Mar 2021 23:58:38 GMT
Title: Instance-level Image Retrieval using Reranking Transformers
Authors: Fuwen Tan, Jiangbo Yuan, Vicente Ordonez
Abstract summary: Instance-level image retrieval is the task of searching in a large database for images that match an object in a query image. We propose Reranking Transformers (RRTs) as a general model to incorporate both local and global features to rerank the matching images. RRTs are lightweight and can be easily parallelized so that reranking a set of top matching results can be performed in a single forward-pass.
Score: 18.304597755595697
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instance-level image retrieval is the task of searching in a large database for images that match an object in a query image. To address this task, systems usually rely on a retrieval step that uses global image descriptors, and a subsequent step that performs domain-specific refinements or reranking by leveraging operations such as geometric verification based on local features. In this work, we propose Reranking Transformers (RRTs) as a general model to incorporate both local and global features to rerank the matching images in a supervised fashion and thus replace the relatively expensive process of geometric verification. RRTs are lightweight and can be easily parallelized so that reranking a set of top matching results can be performed in a single forward-pass. We perform extensive experiments on the Revisited Oxford and Paris datasets, and the Google Landmark v2 dataset, showing that RRTs outperform previous reranking approaches while using much fewer local descriptors. Moreover, we demonstrate that, unlike existing approaches, RRTs can be optimized jointly with the feature extractor, which can lead to feature representations tailored to downstream tasks and further accuracy improvements. Training code and pretrained models will be made public.

Related papers

Referring Expression Instance Retrieval and A Strong End-to-End Baseline [37.47466772169063]
Text-Image Retrieval retrieves a target image from a gallery based on an image-level description.<n>Referring Expression localizes a target object within a given image using an instance-level description.<n>We introduce a new task called textbfReferring Expression Instance Retrieval (REIR), which supports both instance-level retrieval and localization.
arXiv Detail & Related papers (2025-06-23T02:28:44Z)
QuARI: Query Adaptive Retrieval Improvement [10.896025071832055]
We show that a linear transformation of VLM features trained for instance retrieval can improve performance by emphasizing subspaces that relate to the domain of interest.<n>Because this transformation is linear, it can be applied with minimal computational cost to millions of image embeddings.<n>Results show that this method consistently outperforms state-of-the-art alternatives, including those that require many orders of magnitude more at query time.
arXiv Detail & Related papers (2025-05-27T18:21:48Z)
Feature Alignment with Equivariant Convolutions for Burst Image Super-Resolution [52.55429225242423]
We propose a novel framework for Burst Image Super-Resolution (BISR), featuring an equivariant convolution-based alignment. This enables the alignment transformation to be learned via explicit supervision in the image domain and easily applied in the feature domain. Experiments on BISR benchmarks show the superior performance of our approach in both quantitative metrics and visual quality.
arXiv Detail & Related papers (2025-03-11T11:13:10Z)
Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent. Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments. We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z)
Graph Convolution Based Efficient Re-Ranking for Visual Retrieval [29.804582207550478]
We present an efficient re-ranking method which refines initial retrieval results by updating features. Specifically, we reformulate re-ranking based on Graph Convolution Networks (GCN) and propose a novel Graph Convolution based Re-ranking (GCR) for visual retrieval tasks via feature propagation. In particular, the plain GCR is extended for cross-camera retrieval and an improved feature propagation formulation is presented to leverage affinity relationships across different cameras.
arXiv Detail & Related papers (2023-06-15T00:28:08Z)
Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR) It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z)
$R^{2}$Former: Unified $R$etrieval and $R$eranking Transformer for Place Recognition [92.56937383283397]
We propose a unified place recognition framework that handles both retrieval and reranking. The proposed reranking module takes feature correlation, attention value, and xy coordinates into account. $R2$Former significantly outperforms state-of-the-art methods on major VPR datasets.
arXiv Detail & Related papers (2023-04-06T23:19:32Z)
Recursive Generalization Transformer for Image Super-Resolution [108.67898547357127]
We propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images. We combine the RG-SA with local self-attention to enhance the exploitation of the global context. Our RGT outperforms recent state-of-the-art methods quantitatively and qualitatively.
arXiv Detail & Related papers (2023-03-11T10:44:44Z)
SImProv: Scalable Image Provenance Framework for Robust Content Attribution [80.25476792081403]
We present SImProv, a framework to match a query image back to a trusted database of originals. SimProv consists of three stages: a scalable search stage for retrieving top-k most similar images; a re-ranking and near-duplicated detection stage for identifying the original among the candidates. We demonstrate effective retrieval and manipulation detection over a dataset of 100 million images.
arXiv Detail & Related papers (2022-06-28T18:42:36Z)
Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information [15.32353270625554]
Cross-modal remote sensing text-image retrieval (RSCTIR) has recently become an urgent research hotspot due to its ability of enabling fast and flexible information extraction on remote sensing (RS) images. We first propose a novel RSCTIR framework based on global and local information (GaLR), and design a multi-level information dynamic fusion (MIDF) module to efficaciously integrate features of different levels. Experiments on public datasets strongly demonstrate the state-of-the-art performance of GaLR methods on the RSCTIR task.
arXiv Detail & Related papers (2022-04-21T03:18:09Z)
Reuse your features: unifying retrieval and feature-metric alignment [3.845387441054033]
DRAN is the first network able to produce the features for the three steps of visual localization. It achieves competitive performance in terms of robustness and accuracy under challenging conditions in public benchmarks.
arXiv Detail & Related papers (2022-04-13T10:42:00Z)
Fusing Local Similarities for Retrieval-based 3D Orientation Estimation of Unseen Objects [70.49392581592089]
We tackle the task of estimating the 3D orientation of previously-unseen objects from monocular images. We follow a retrieval-based strategy and prevent the network from learning object-specific features. Our experiments on the LineMOD, LineMOD-Occluded, and T-LESS datasets show that our method yields a significantly better generalization to unseen objects than previous works.
arXiv Detail & Related papers (2022-03-16T08:53:00Z)
LoFTR: Detector-Free Local Feature Matching with Transformers [40.754990768677295]
Instead of performing image feature detection, description, and matching sequentially, we propose to first establish pixel-wise dense matches at a coarse level. In contrast to dense methods that use a cost volume to search correspondences, we use self and cross attention layers in Transformer to obtain feature descriptors that are conditioned on both images. The experiments on indoor and outdoor datasets show that LoFTR outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-04-01T17:59:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.