LLM-Guided Agentic Object Detection for Open-World Understanding
- URL: http://arxiv.org/abs/2507.10844v1
- Date: Mon, 14 Jul 2025 22:30:48 GMT
- Title: LLM-Guided Agentic Object Detection for Open-World Understanding
- Authors: Furkan Mumcu, Michael J. Jones, Anoop Cherian, Yasin Yilmaz,
- Abstract summary: Object detection traditionally relies on fixed category sets, requiring costly re-training to handle novel objects.<n>We propose an LLM-guided agentic object detection framework that enables fully label-free, zero-shot detection.<n>Our method offers enhanced autonomy and adaptability for open-world understanding.
- Score: 45.08126325125808
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Object detection traditionally relies on fixed category sets, requiring costly re-training to handle novel objects. While Open-World and Open-Vocabulary Object Detection (OWOD and OVOD) improve flexibility, OWOD lacks semantic labels for unknowns, and OVOD depends on user prompts, limiting autonomy. We propose an LLM-guided agentic object detection (LAOD) framework that enables fully label-free, zero-shot detection by prompting a Large Language Model (LLM) to generate scene-specific object names. These are passed to an open-vocabulary detector for localization, allowing the system to adapt its goals dynamically. We introduce two new metrics, Class-Agnostic Average Precision (CAAP) and Semantic Naming Average Precision (SNAP), to separately evaluate localization and naming. Experiments on LVIS, COCO, and COCO-OOD validate our approach, showing strong performance in detecting and naming novel objects. Our method offers enhanced autonomy and adaptability for open-world understanding.
Related papers
- Beyond General Prompts: Automated Prompt Refinement using Contrastive Class Alignment Scores for Disambiguating Objects in Vision-Language Models [0.0]
We introduce a method for automated prompt refinement using a novel metric called the Contrastive Class Alignment Score (CCAS)<n>Our method generates diverse prompt candidates via a large language model and filters them through CCAS, computed using prompt embeddings from a sentence transformer.<n>We evaluate our approach on challenging object categories, demonstrating that our automatic selection of high-precision prompts improves object detection accuracy without the need for model training or labeled data.
arXiv Detail & Related papers (2025-05-14T04:43:36Z) - From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects [0.6262268096839562]
Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an in-principle unbounded vocabulary.<n>OVD relies on accurate prompts provided by an oracle'', which limits their use in critical applications such as driving scene perception.<n>We propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning previously unseen objects.
arXiv Detail & Related papers (2024-11-27T10:33:51Z) - Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation [1.590984668118904]
FOCUS: Finegrained Open-Vocabulary Object ReCognition via User-Guided.
We propose a novel foundation model-based detection method called FOCUS: Finegrained Open-Vocabulary Object ReCognition via User-Guided.
arXiv Detail & Related papers (2024-11-23T18:13:27Z) - Semi-supervised Open-World Object Detection [74.95267079505145]
We introduce a more realistic formulation, named semi-supervised open-world detection (SS-OWOD)
We demonstrate that the performance of the state-of-the-art OWOD detector dramatically deteriorates in the proposed SS-OWOD setting.
Our experiments on 4 datasets including MS COCO, PASCAL, Objects365 and DOTA demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-02-25T07:12:51Z) - Weakly Supervised Open-Vocabulary Object Detection [31.605276665964787]
We propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD.
To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment.
arXiv Detail & Related papers (2023-12-19T18:59:53Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - CAT: LoCalization and IdentificAtion Cascade Detection Transformer for
Open-World Object Detection [17.766859354014663]
Open-world object detection requires a model trained from data on known objects to detect both known and unknown objects.
We propose a novel solution called CAT: LoCalization and IdentificAtion Cascade Detection Transformer.
We show that our model outperforms the state-of-the-art in terms of all metrics in the task of OWOD, incremental object detection (IOD) and open-set detection.
arXiv Detail & Related papers (2023-01-05T09:11:16Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - UDA-COPE: Unsupervised Domain Adaptation for Category-level Object Pose
Estimation [84.16372642822495]
We propose an unsupervised domain adaptation (UDA) for category-level object pose estimation, called textbfUDA-COPE.
Inspired by the recent multi-modal UDA techniques, the proposed method exploits a teacher-student self-supervised learning scheme to train a pose estimation network without using target domain labels.
arXiv Detail & Related papers (2021-11-24T16:00:48Z) - Scope Head for Accurate Localization in Object Detection [135.9979405835606]
We propose a novel detector coined as ScopeNet, which models anchors of each location as a mutually dependent relationship.
With our concise and effective design, the proposed ScopeNet achieves state-of-the-art results on COCO.
arXiv Detail & Related papers (2020-05-11T04:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.