Multi-modal Queried Object Detection in the Wild
- URL: http://arxiv.org/abs/2305.18980v2
- Date: Sun, 8 Oct 2023 11:08:31 GMT
- Title: Multi-modal Queried Object Detection in the Wild
- Authors: Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xiaoshan Yang, Ke
Li, Changsheng Xu
- Abstract summary: MQ-Det is an efficient architecture and pre-training strategy design for real-world object detection.
It incorporates vision queries into existing language-queried-only detectors.
MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors.
- Score: 72.16067634379226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce MQ-Det, an efficient architecture and pre-training strategy
design to utilize both textual description with open-set generalization and
visual exemplars with rich description granularity as category queries, namely,
Multi-modal Queried object Detection, for real-world detection with both
open-vocabulary categories and various granularity. MQ-Det incorporates vision
queries into existing well-established language-queried-only detectors. A
plug-and-play gated class-scalable perceiver module upon the frozen detector is
proposed to augment category text with class-wise visual information. To
address the learning inertia problem brought by the frozen detector, a vision
conditioned masked language prediction strategy is proposed. MQ-Det's simple
yet effective architecture and training strategy design is compatible with most
language-queried object detectors, thus yielding versatile applications.
Experimental results demonstrate that multi-modal queries largely boost
open-world detection. For instance, MQ-Det significantly improves the
state-of-the-art open-set detector GLIP by +7.8% AP on the LVIS benchmark via
multi-modal queries without any downstream finetuning, and averagely +6.3% AP
on 13 few-shot downstream tasks, with merely additional 3% modulating time
required by GLIP. Code is available at https://github.com/YifanXu74/MQ-Det.
Related papers
- A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning [9.786907179872815]
The potential of vision and language remains underexplored in face forgery detection.
There is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task.
We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap.
arXiv Detail & Related papers (2024-10-01T08:16:40Z) - LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction [63.668635390907575]
Existing methods enhance open-vocabulary object detection by leveraging the robust open-vocabulary recognition capabilities of Vision-Language Models (VLMs)
We propose the Language Model Instruction (LaMI) strategy, which leverages the relationships between visual concepts and applies them within a simple yet effective DETR-like detector.
arXiv Detail & Related papers (2024-07-16T02:58:33Z) - The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.