Related papers: Test-time Vocabulary Adaptation for Language-driven Object Detection

Test-time Vocabulary Adaptation for Language-driven Object Detection

URL: http://arxiv.org/abs/2506.00333v1
Date: Sat, 31 May 2025 01:15:29 GMT
Title: Test-time Vocabulary Adaptation for Language-driven Object Detection
Authors: Mingxuan Liu, Tyler L. Hayes, Massimiliano Mancini, Elisa Ricci, Riccardo Volpi, Gabriela Csurka,
Abstract summary: We propose a plug-and-play Vocabulary Adapter (VocAda) to refine the user-defined vocabulary.<n>VocAda does not require any training, it operates at inference time in three steps.<n> Experiments on COCO and Objects365 with three state-of-the-art detectors show that VocAda consistently improves performance.
Score: 42.25065847785535
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary object detection models allow users to freely specify a class vocabulary in natural language at test time, guiding the detection of desired objects. However, vocabularies can be overly broad or even mis-specified, hampering the overall performance of the detector. In this work, we propose a plug-and-play Vocabulary Adapter (VocAda) to refine the user-defined vocabulary, automatically tailoring it to categories that are relevant for a given image. VocAda does not require any training, it operates at inference time in three steps: i) it uses an image captionner to describe visible objects, ii) it parses nouns from those captions, and iii) it selects relevant classes from the user-defined vocabulary, discarding irrelevant ones. Experiments on COCO and Objects365 with three state-of-the-art detectors show that VocAda consistently improves performance, proving its versatility. The code is open source.

Related papers

From Open-Vocabulary to Vocabulary-Free Semantic Segmentation [78.62232202171919]
Open-vocabulary semantic segmentation enables models to identify novel object categories beyond their training data.<n>Current approaches still rely on manually specified class names as input, creating an inherent bottleneck in real-world applications.<n>This work proposes a Vocabulary-Free Semantic pipeline, eliminating the need for predefined class vocabularies.
arXiv Detail & Related papers (2025-02-17T15:17:08Z)
3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation [20.7179907935644]
3D-AVS is a method for Auto-Vocabulary of 3D point clouds for which the vocabulary is unknown and auto-generated for each input at runtime.<n>3D-AVS first recognizes semantic entities from image or point cloud data and then segments all points with the automatically generated vocabulary.<n>Our method incorporates both image-based and point-based recognition, enhancing robustness under challenging lighting conditions.
arXiv Detail & Related papers (2024-06-13T13:59:47Z)
Generative Region-Language Pretraining for Open-Ended Object Detection [55.42484781608621]
We propose a framework named GenerateU, which can detect dense objects and generate their names in a free-form way. Our framework achieves comparable results to the open-vocabulary object detection method GLIP.
arXiv Detail & Related papers (2024-03-15T10:52:39Z)
Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD) We adopt a standard two-stage object detector architecture. We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z)
V3Det: Vast Vocabulary Visual Detection Dataset [69.50942928928052]
V3Det is a vast vocabulary visual detection dataset with precisely annotated bounding boxes on massive images. By offering a vast exploration space, V3Det enables extensive benchmarks on both vast and open vocabulary object detection.
arXiv Detail & Related papers (2023-04-07T17:45:35Z)
Open-Vocabulary Object Detection using Pseudo Caption Labels [3.260777306556596]
We argue that more fine-grained labels are necessary to extract richer knowledge about novel objects. Our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance.
arXiv Detail & Related papers (2023-03-23T05:10:22Z)
Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data. It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.