FOR: Finetuning for Object Level Open Vocabulary Image Retrieval
- URL: http://arxiv.org/abs/2412.18806v1
- Date: Wed, 25 Dec 2024 07:08:51 GMT
- Title: FOR: Finetuning for Object Level Open Vocabulary Image Retrieval
- Authors: Hila Levi, Guy Heller, Dan Levi,
- Abstract summary: We propose FOR: Finetuning for Object-centric Open-vocabulary Image Retrieval, which allows finetuning on a target dataset using closed-set labels.
FOR is based on two design elements: a specialized decoder variant of the CLIP head customized for the intended task, and its coupling within a multi-language training framework.
- Score: 1.0650780147044159
- License:
- Abstract: As working with large datasets becomes standard, the task of accurately retrieving images containing objects of interest by an open set textual query gains practical importance. The current leading approach utilizes a pre-trained CLIP model without any adaptation to the target domain, balancing accuracy and efficiency through additional post-processing. In this work, we propose FOR: Finetuning for Object-centric Open-vocabulary Image Retrieval, which allows finetuning on a target dataset using closed-set labels while keeping the visual-language association crucial for open vocabulary retrieval. FOR is based on two design elements: a specialized decoder variant of the CLIP head customized for the intended task, and its coupling within a multi-objective training framework. Together, these design choices result in a significant increase in accuracy, showcasing improvements of up to 8 mAP@50 points over SoTA across three datasets. Additionally, we demonstrate that FOR is also effective in a semi-supervised setting, achieving impressive results even when only a small portion of the dataset is labeled.
Related papers
- Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation [47.047267066525265]
We introduce a novel approach that incorporates object-level contextual knowledge within images.
Our proposed approach achieves state-of-the-art performance with strong generalizability across diverse datasets.
arXiv Detail & Related papers (2024-11-26T06:34:48Z) - Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations [1.1650821883155187]
We propose Contrastive $lambda$-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences.
Our method integrates the following three key types of features into a multi-level aligned representation.
We evaluate Contrastive $lambda$-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform.
arXiv Detail & Related papers (2024-10-01T06:35:34Z) - DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Weakly Supervised Open-Vocabulary Object Detection [31.605276665964787]
We propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD.
To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment.
arXiv Detail & Related papers (2023-12-19T18:59:53Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Learning Customized Visual Models with Retrieval-Augmented Knowledge [104.05456849611895]
We propose REACT, a framework to acquire the relevant web knowledge to build customized visual models for target domains.
We retrieve the most relevant image-text pairs from the web-scale database as external knowledge, and propose to customize the model by only training new modualized blocks while freezing all the original weights.
The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings.
arXiv Detail & Related papers (2023-01-17T18:59:06Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - FAIRS -- Soft Focus Generator and Attention for Robust Object
Segmentation from Extreme Points [70.65563691392987]
We present a new approach to generate object segmentation from user inputs in the form of extreme points and corrective clicks.
We demonstrate our method's ability to generate high-quality training data as well as its scalability in incorporating extreme points, guiding clicks, and corrective clicks in a principled manner.
arXiv Detail & Related papers (2020-04-04T22:25:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.