DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
- URL: http://arxiv.org/abs/2411.14347v2
- Date: Fri, 06 Dec 2024 01:45:46 GMT
- Title: DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
- Authors: Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, Xingyu Chen, Zhuheng Song, Yuhong Zhang, Hongjie Huang, Han Gao, Shilong Liu, Hao Zhang, Feng Li, Kent Yu, Lei Zhang,
- Abstract summary: We introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research.<n>DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding.<n>We develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt.
- Score: 35.522774800394664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, improving the previous SOTA performance by 5.8 AP and 5.0 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.
Related papers
- MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism [67.56918651825056]
We propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism.
Our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark.
A series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.
arXiv Detail & Related papers (2025-03-03T12:19:06Z) - LoXR: Performance Evaluation of Locally Executing LLMs on XR Devices [55.33807002543901]
We deploy 17 large language models (LLMs) across four XR devices.
We evaluate performance on four key metrics: performance consistency, processing speed, memory usage, and battery consumption.
arXiv Detail & Related papers (2025-02-13T20:55:48Z) - Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models [47.18069715855738]
Recent vision foundation models can extract universal representations and show impressive abilities in various tasks.
We show that frozen foundation models can be a versatile feature enhancer, even though they are not pre-trained for object detection.
arXiv Detail & Related papers (2024-10-25T15:38:24Z) - Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community [50.16478515591924]
We propose and train the novel LAE-DINO Model, the first open-vocabulary foundation object detector for the LAE task.
We conduct experiments on established remote sensing benchmark DIOR, DOTAv2.0, as well as our newly introduced 80-class LAE-80C benchmark.
Results demonstrate the advantages of the LAE-1M dataset and the effectiveness of the LAE-DINO method.
arXiv Detail & Related papers (2024-08-17T06:24:43Z) - Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection [23.464027681439706]
Grounding DINO 1.5 is a suite of advanced open-set object detection models developed by IDEA Research.
Grounding DINO 1.5 Pro is a high-performance model designed for stronger generalization capability across a wide range of scenarios.
Grounding DINO 1.5 Edge is an efficient optimized model for faster speed demanded in many applications requiring edge deployment.
arXiv Detail & Related papers (2024-05-16T17:54:15Z) - DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - YOLO-World: Real-Time Open-Vocabulary Object Detection [87.08732047660058]
We introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities.
Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency.
YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed.
arXiv Detail & Related papers (2024-01-30T18:59:38Z) - From Blurry to Brilliant Detection: YOLOv5-Based Aerial Object Detection
with Super Resolution [4.107182710549721]
We present an innovative approach that combines super-resolution and an adapted lightweight YOLOv5 architecture.
Our experimental results demonstrate the model's superior performance in detecting small and densely clustered objects.
arXiv Detail & Related papers (2024-01-26T05:50:58Z) - Open World Object Detection in the Era of Foundation Models [53.683963161370585]
We introduce a new benchmark that includes five real-world application-driven datasets.
We introduce a novel method, Foundation Object detection Model for the Open world, or FOMO, which identifies unknown objects based on their shared attributes with the base known objects.
arXiv Detail & Related papers (2023-12-10T03:56:06Z) - Long Range Object-Level Monocular Depth Estimation for UAVs [0.0]
We propose several novel extensions to state-of-the-art methods for monocular object detection from images at long range.
Firstly, we propose Sigmoid and ReLU-like encodings when modeling depth estimation as a regression task.
Secondly, we frame the depth estimation as a classification problem and introduce a Soft-Argmax function in the calculation of the training loss.
arXiv Detail & Related papers (2023-02-17T15:26:04Z) - RTMDet: An Empirical Study of Designing Real-Time Object Detectors [13.09100888887757]
We develop an efficient real-time object detector that exceeds the YOLO series and is easily for many object recognition tasks.
Together with better training techniques, the resulting object detector achieves, named RTMDet, 52.8% AP on COCO with 300+ FPS on an NVIDIA 3090 GPU.
We hope the experimental results can provide new insights into designing versatile real-time object detectors for many object recognition tasks.
arXiv Detail & Related papers (2022-12-14T18:50:20Z) - Anchor Retouching via Model Interaction for Robust Object Detection in
Aerial Images [15.404024559652534]
We present an effective Dynamic Enhancement Anchor (DEA) network to construct a novel training sample generator.
Our method achieves state-of-the-art performance in accuracy with moderate inference speed and computational overhead for training.
arXiv Detail & Related papers (2021-12-13T14:37:20Z) - One-Shot Object Affordance Detection in the Wild [76.46484684007706]
Affordance detection refers to identifying the potential action possibilities of objects in an image.
We devise a One-Shot Affordance Detection Network (OSAD-Net) that estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images.
With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods.
arXiv Detail & Related papers (2021-08-08T14:53:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.