YOLO-World: Real-Time Open-Vocabulary Object Detection
- URL: http://arxiv.org/abs/2401.17270v3
- Date: Thu, 22 Feb 2024 13:05:52 GMT
- Title: YOLO-World: Real-Time Open-Vocabulary Object Detection
- Authors: Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying
Shan
- Abstract summary: We introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities.
Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency.
YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed.
- Score: 87.08732047660058
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The You Only Look Once (YOLO) series of detectors have established themselves
as efficient and practical tools. However, their reliance on predefined and
trained object categories limits their applicability in open scenarios.
Addressing this limitation, we introduce YOLO-World, an innovative approach
that enhances YOLO with open-vocabulary detection capabilities through
vision-language modeling and pre-training on large-scale datasets.
Specifically, we propose a new Re-parameterizable Vision-Language Path
Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate
the interaction between visual and linguistic information. Our method excels in
detecting a wide range of objects in a zero-shot manner with high efficiency.
On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on
V100, which outperforms many state-of-the-art methods in terms of both accuracy
and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable
performance on several downstream tasks, including object detection and
open-vocabulary instance segmentation.
Related papers
- Evaluating the Evolution of YOLO (You Only Look Once) Models: A Comprehensive Benchmark Study of YOLO11 and Its Predecessors [0.0]
This study presents a benchmark analysis of various YOLO (You Only Look Once) algorithms, from YOLOv3 to the newest addition, YOLO11.
It evaluates their performance on three diverse datasets: Traffic Signs (with varying object sizes), African Wildlife (with diverse aspect ratios and at least one instance of the object per image), and Ships and Vessels (with small-sized objects of a single class)
arXiv Detail & Related papers (2024-10-31T20:45:00Z) - Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community [50.16478515591924]
We propose and train the novel LAE-DINO Model, the first open-vocabulary foundation object detector for the LAE task.
We conduct experiments on established remote sensing benchmark DIOR, DOTAv2.0, as well as our newly introduced 80-class LAE-80C benchmark.
Results demonstrate the advantages of the LAE-1M dataset and the effectiveness of the LAE-DINO method.
arXiv Detail & Related papers (2024-08-17T06:24:43Z) - YOLOv5, YOLOv8 and YOLOv10: The Go-To Detectors for Real-time Vision [0.6662800021628277]
This paper focuses on the evolution of the YOLO (You Only Look Once) object detection algorithm, focusing on YOLOv5, YOLOv8, and YOLOv10.
We analyze the architectural advancements, performance improvements, and suitability for edge deployment across these versions.
arXiv Detail & Related papers (2024-07-03T10:40:20Z) - YOLOv10: Real-Time End-to-End Object Detection [68.28699631793967]
YOLOs have emerged as the predominant paradigm in the field of real-time object detection.
The reliance on the non-maximum suppression (NMS) for post-processing hampers the end-to-end deployment of YOLOs.
We introduce the holistic efficiency-accuracy driven model design strategy for YOLOs.
arXiv Detail & Related papers (2024-05-23T11:44:29Z) - Investigating YOLO Models Towards Outdoor Obstacle Detection For
Visually Impaired People [3.4628430044380973]
Seven different YOLO object detection models were implemented.
YOLOv8 was found to be the best model, which reached a precision of $80%$ and a recall of $68.2%$ on a well-known Obstacle dataset.
YOLO-NAS was found to be suboptimal for the obstacle detection task.
arXiv Detail & Related papers (2023-12-10T13:16:22Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time
Object Detection [80.11152626362109]
We provide an efficient and performant object detector, termed YOLO-MS.
We train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets.
Our work can also be used as a plug-and-play module for other YOLO models.
arXiv Detail & Related papers (2023-08-10T10:12:27Z) - A lightweight and accurate YOLO-like network for small target detection
in Aerial Imagery [94.78943497436492]
We present YOLO-S, a simple, fast and efficient network for small target detection.
YOLO-S exploits a small feature extractor based on Darknet20, as well as skip connection, via both bypass and concatenation.
YOLO-S has an 87% decrease of parameter size and almost one half FLOPs of YOLOv3, making practical the deployment for low-power industrial applications.
arXiv Detail & Related papers (2022-04-05T16:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.