MQADet: A Plug-and-Play Paradigm for Enhancing Open-Vocabulary Object Detection via Multimodal Question Answering
- URL: http://arxiv.org/abs/2502.16486v1
- Date: Sun, 23 Feb 2025 07:59:39 GMT
- Title: MQADet: A Plug-and-Play Paradigm for Enhancing Open-Vocabulary Object Detection via Multimodal Question Answering
- Authors: Caixiong Li, Xiongwei Zhao, Jinhang Zhang, Xing Zhang, Zhou Wu,
- Abstract summary: Existing open-vocabulary detectors are limited by complex visual-textual misalignment and long-tailed category imbalances.<n>We introduce textbfMQADet, a universal paradigm for enhancing existing open-vocabulary detectors.<n> MQADet functions as a plug-and-play solution that integrates seamlessly with pre-trained object detectors.
- Score: 5.332289905373051
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-vocabulary detection (OVD) is a challenging task to detect and classify objects from an unrestricted set of categories, including those unseen during training. Existing open-vocabulary detectors are limited by complex visual-textual misalignment and long-tailed category imbalances, leading to suboptimal performance in challenging scenarios. To address these limitations, we introduce \textbf{MQADet}, a universal paradigm for enhancing existing open-vocabulary detectors by leveraging the cross-modal reasoning capabilities of multimodal large language models (MLLMs). MQADet functions as a plug-and-play solution that integrates seamlessly with pre-trained object detectors without substantial additional training costs. Specifically, we design a novel three-stage Multimodal Question Answering (MQA) pipeline to guide the MLLMs to precisely localize complex textual and visual targets while effectively enhancing the focus of existing object detectors on relevant objects. To validate our approach, we present a new benchmark for evaluating our paradigm on four challenging open-vocabulary datasets, employing three state-of-the-art object detectors as baselines. Experimental results demonstrate that our proposed paradigm significantly improves the performance of existing detectors, particularly in unseen complex categories, across diverse and challenging scenarios. To facilitate future research, we will publicly release our code.
Related papers
- Cross-domain Few-shot Object Detection with Multi-modal Textual Enrichment [21.36633828492347]
Cross-Domain Multi-Modal Few-Shot Object Detection (CDMM-FSOD)<n>We introduce a meta-learning-based framework designed to leverage rich textual semantics as an auxiliary modality to achieve effective domain adaptation.<n>We evaluate the proposed method on common cross-domain object detection benchmarks and demonstrate that it significantly surpasses existing few-shot object detection approaches.
arXiv Detail & Related papers (2025-02-23T06:59:22Z) - From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning [71.41062111470414]
The proposed plug-and-play framework interfaces with any open-vocabulary detector.
At its core, our approach combines (i) a symbolic regression mechanism exploring relationship patterns among detected entities.
We compared our training-free framework against specialized event recognition systems across diverse application domains.
arXiv Detail & Related papers (2025-02-09T10:30:54Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Multi-modal Queried Object Detection in the Wild [72.16067634379226]
MQ-Det is an efficient architecture and pre-training strategy design for real-world object detection.
It incorporates vision queries into existing language-queried-only detectors.
MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors.
arXiv Detail & Related papers (2023-05-30T12:24:38Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - Discovery-and-Selection: Towards Optimal Multiple Instance Learning for
Weakly Supervised Object Detection [86.86602297364826]
We propose a discoveryand-selection approach fused with multiple instance learning (DS-MIL)
Our proposed DS-MIL approach can consistently improve the baselines, reporting state-of-the-art performance.
arXiv Detail & Related papers (2021-10-18T07:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.