QuARI: Query Adaptive Retrieval Improvement
- URL: http://arxiv.org/abs/2505.21647v1
- Date: Tue, 27 May 2025 18:21:48 GMT
- Title: QuARI: Query Adaptive Retrieval Improvement
- Authors: Eric Xing, Abby Stylianou, Robert Pless, Nathan Jacobs,
- Abstract summary: We show that a linear transformation of VLM features trained for instance retrieval can improve performance by emphasizing subspaces that relate to the domain of interest.<n>Because this transformation is linear, it can be applied with minimal computational cost to millions of image embeddings.<n>Results show that this method consistently outperforms state-of-the-art alternatives, including those that require many orders of magnitude more at query time.
- Score: 10.896025071832055
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Massive-scale pretraining has made vision-language models increasingly popular for image-to-image and text-to-image retrieval across a broad collection of domains. However, these models do not perform well when used for challenging retrieval tasks, such as instance retrieval in very large-scale image collections. Recent work has shown that linear transformations of VLM features trained for instance retrieval can improve performance by emphasizing subspaces that relate to the domain of interest. In this paper, we explore a more extreme version of this specialization by learning to map a given query to a query-specific feature space transformation. Because this transformation is linear, it can be applied with minimal computational cost to millions of image embeddings, making it effective for large-scale retrieval or re-ranking. Results show that this method consistently outperforms state-of-the-art alternatives, including those that require many orders of magnitude more computation at query time.
Related papers
- Composed Object Retrieval: Object-level Retrieval via Composed Expressions [71.47650333199628]
Composed Object Retrieval (COR) is a brand-new task that goes beyond image-level retrieval to achieve object-level precision.<n>We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories.<n>We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning.
arXiv Detail & Related papers (2025-08-06T13:11:40Z) - Enhancing Multi-Image Question Answering via Submodular Subset Selection [16.66633426354087]
Large multimodal models (LMMs) have achieved high performance in vision-language tasks involving single image but struggle when presented with a collection of multiple images.<n>We propose an enhancement for retriever framework introduced in MIRAGE model using submodular subset selection techniques.
arXiv Detail & Related papers (2025-05-15T17:41:52Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification [9.843214426749764]
We propose retrieval-enhanced visual prompt learning (RePrompt) to cache and reuse knowledge of downstream tasks.
During inference, our enhanced model can reference similar samples brought by retrieval to make more accurate predictions.
RePrompt attains state-of-the-art performance on a wide range of vision datasets.
arXiv Detail & Related papers (2023-06-04T03:06:37Z) - Granularity-aware Adaptation for Image Retrieval over Multiple Tasks [30.505620321478688]
Grappa is an approach that starts from a strong pretrained model, and adapts it to tackle multiple retrieval tasks concurrently.
We reconcile all adaptor sets into a single unified model suited for all retrieval tasks by learning fusion layers.
Results on a benchmark composed of six heterogeneous retrieval tasks show that the unsupervised Grappa model improves the zero-shot performance of a state-of-the-art self-supervised learning model.
arXiv Detail & Related papers (2022-10-05T13:31:52Z) - Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR)
We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries.
Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z) - Cross-Modal Retrieval Augmentation for Multi-Modal Classification [61.5253261560224]
We explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering.
First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement on image-caption retrieval.
Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines.
arXiv Detail & Related papers (2021-04-16T13:27:45Z) - Graph Sampling Based Deep Metric Learning for Generalizable Person
Re-Identification [114.56752624945142]
We argue that the most popular random sampling method, the well-known PK sampler, is not informative and efficient for deep metric learning.
We propose an efficient mini batch sampling method called Graph Sampling (GS) for large-scale metric learning.
arXiv Detail & Related papers (2021-04-04T06:44:15Z) - Instance-level Image Retrieval using Reranking Transformers [18.304597755595697]
Instance-level image retrieval is the task of searching in a large database for images that match an object in a query image.
We propose Reranking Transformers (RRTs) as a general model to incorporate both local and global features to rerank the matching images.
RRTs are lightweight and can be easily parallelized so that reranking a set of top matching results can be performed in a single forward-pass.
arXiv Detail & Related papers (2021-03-22T23:58:38Z) - FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning [64.32306537419498]
We propose a novel learned feature-based refinement and augmentation method that produces a varied set of complex transformations.
These transformations also use information from both within-class and across-class representations that we extract through clustering.
We demonstrate that our method is comparable to current state of art for smaller datasets while being able to scale up to larger datasets.
arXiv Detail & Related papers (2020-07-16T17:55:31Z) - Selecting Relevant Features from a Multi-domain Representation for
Few-shot Classification [91.67977602992657]
We propose a new strategy based on feature selection, which is both simpler and more effective than previous feature adaptation approaches.
We show that a simple non-parametric classifier built on top of such features produces high accuracy and generalizes to domains never seen during training.
arXiv Detail & Related papers (2020-03-20T15:44:17Z) - CBIR using features derived by Deep Learning [0.0]
In a Content Based Image Retrieval (CBIR) System, the task is to retrieve similar images from a large database given a query image.
We propose to use features derived from pre-trained network models from a deep-learning convolution network trained for a large image classification problem.
arXiv Detail & Related papers (2020-02-13T21:26:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.