Related papers: WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

URL: http://arxiv.org/abs/2512.12309v1
Date: Sat, 13 Dec 2025 12:40:28 GMT
Title: WeDetect: Fast Open-Vocabulary Object Detection as Retrieval
Authors: Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie, Wei-Shi Zheng,
Abstract summary: Open-vocabulary object detection aims to detect arbitrary classes via text prompts.<n> Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem.<n>We develop a model family named WeDetect to achieve state-of-the-art performance across 15 benchmarks with high inference efficiency.
Score: 74.39703419628829
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, \ie, matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data. (3) Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass. Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency.

Related papers

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition [71.5328300638085]
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions.<n>Existing methods, including two-stage methods, tightly couple interaction recognition with a specific detector.<n>We propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR.
arXiv Detail & Related papers (2026-02-16T19:01:31Z)
QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection [7.030364980618468]
We propose a novel plug-and-play HOI detection framework that incorporates semantic priors and guided feature learning.<n>Our method achieves state-of-the-art performance and strong generalizations on HICO-Det and V-COCO benchmarks.
arXiv Detail & Related papers (2025-08-12T03:11:16Z)
Generative Region-Language Pretraining for Open-Ended Object Detection [55.42484781608621]
We propose a framework named GenerateU, which can detect dense objects and generate their names in a free-form way. Our framework achieves comparable results to the open-vocabulary object detection method GLIP.
arXiv Detail & Related papers (2024-03-15T10:52:39Z)
Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We introduce a novel retrieval unit, proposition, for dense retrieval. Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z)
The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding [8.448399308205266]
We introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects. We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol.
arXiv Detail & Related papers (2023-11-29T10:40:52Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)
FindIt: Generalized Localization with Natural Language Queries [43.07139534653485]
FindIt is a simple and versatile framework that unifies a variety of visual grounding and localization tasks. Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements. Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries.
arXiv Detail & Related papers (2022-03-31T17:59:30Z)
Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation. We introduce a novel approach for more accurate and efficient unseen-temporal segmentation. We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.