Prompt-Guided Attention Head Selection for Focus-Oriented Image Retrieval
- URL: http://arxiv.org/abs/2504.01348v1
- Date: Wed, 02 Apr 2025 04:33:27 GMT
- Title: Prompt-Guided Attention Head Selection for Focus-Oriented Image Retrieval
- Authors: Yuji Nozawa, Yu-Chieh Lin, Kazumoto Nakamura, Youyang Ng,
- Abstract summary: We propose Prompt-guided attention Head Selection (PHS) to leverage the head-wise potential of the multi-head attention mechanism in Vision Transformer (ViT)<n>PHS selects specific attention heads by matching their attention maps with user's visual prompts, such as a point, box, or segmentation.<n>PHS substantially improves performance on multiple datasets, offering a practical and training-free solution to enhance model performance in the Focus-Oriented Image Retrieval (FOIR) task.
- Score: 1.3905735045377272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this paper is to enhance pretrained Vision Transformer (ViT) models for focus-oriented image retrieval with visual prompting. In real-world image retrieval scenarios, both query and database images often exhibit complexity, with multiple objects and intricate backgrounds. Users often want to retrieve images with specific object, which we define as the Focus-Oriented Image Retrieval (FOIR) task. While a standard image encoder can be employed to extract image features for similarity matching, it may not perform optimally in the multi-object-based FOIR task. This is because each image is represented by a single global feature vector. To overcome this, a prompt-based image retrieval solution is required. We propose an approach called Prompt-guided attention Head Selection (PHS) to leverage the head-wise potential of the multi-head attention mechanism in ViT in a promptable manner. PHS selects specific attention heads by matching their attention maps with user's visual prompts, such as a point, box, or segmentation. This empowers the model to focus on specific object of interest while preserving the surrounding visual context. Notably, PHS does not necessitate model re-training and avoids any image alteration. Experimental results show that PHS substantially improves performance on multiple datasets, offering a practical and training-free solution to enhance model performance in the FOIR task.
Related papers
- Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization [5.2337753974570616]
We address the challenge of Small Object Image Retrieval (SoIR), where the goal is to retrieve images containing a specific small object, in a cluttered scene.<n>Key challenge is constructing a single image descriptor, for scalable and efficient search, that effectively represents all objects in the image.<n>We introduce Multi-object Attention Optimization (MaO), a novel retrieval framework which incorporates a dedicated multi-object pre-training phase.
arXiv Detail & Related papers (2025-03-10T08:27:02Z) - Ranking-aware adapter for text-driven image ordering with CLIP [76.80965830448781]
We propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task.
Our approach incorporates learnable prompts to adapt to new instructions for ranking purposes.
Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks.
arXiv Detail & Related papers (2024-12-09T18:51:05Z) - Unifying Image Processing as Visual Prompting Question Answering [62.84955983910612]
Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications.
Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise.
We propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks.
arXiv Detail & Related papers (2023-10-16T15:32:57Z) - Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision.
We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z) - Toward an ImageNet Library of Functions for Global Optimization
Benchmarking [0.0]
This study proposes to transform the identification problem into an image recognition problem, with a potential to detect conception-free, machine-driven landscape features.
We address it as a supervised multi-class image recognition problem and apply basic artificial neural network models to solve it.
This evident successful learning is another step toward automated feature extraction and local structure deduction of BBO problems.
arXiv Detail & Related papers (2022-06-27T21:05:00Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Tasks Integrated Networks: Joint Detection and Retrieval for Image
Search [99.49021025124405]
In many real-world searching scenarios (e.g., video surveillance), the objects are seldom accurately detected or annotated.
We first introduce an end-to-end Integrated Net (I-Net), which has three merits.
We further propose an improved I-Net, called DC-I-Net, which makes two new contributions.
arXiv Detail & Related papers (2020-09-03T03:57:50Z) - Keypoint-Aligned Embeddings for Image Retrieval and Re-identification [15.356786390476591]
We propose to align the image embedding with a predefined order of the keypoints.
The proposed keypoint aligned embeddings model (KAE-Net) learns part-level features via multi-task learning.
It achieves state of the art performance on the benchmark datasets of CUB-200-2011, Cars196 and VeRi-776.
arXiv Detail & Related papers (2020-08-26T03:56:37Z) - Semantically Tied Paired Cycle Consistency for Any-Shot Sketch-based
Image Retrieval [55.29233996427243]
Low-shot sketch-based image retrieval is an emerging task in computer vision.
In this paper, we address any-shot, i.e. zero-shot and few-shot, sketch-based image retrieval (SBIR) tasks.
For solving these tasks, we propose a semantically aligned cycle-consistent generative adversarial network (SEM-PCYC)
Our results demonstrate a significant boost in any-shot performance over the state-of-the-art on the extended version of the Sketchy, TU-Berlin and QuickDraw datasets.
arXiv Detail & Related papers (2020-06-20T22:43:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.