Related papers: Real-Time Onboard Object Detection for Augmented Reality: Enhancing Head-Mounted Display with YOLOv8

Related papers

Gaze Prediction in Virtual Reality Without Eye Tracking Using Visual and Head Motion Cues [3.4383905541567583]
We present a novel gaze prediction framework that combines Head-Mounted Display (HMD) motion signals with visual saliency cues derived from video frames.<n>Our method employs UniSal, a lightweight saliency encoder, to extract visual features, which are then fused with HMD motion data and processed through a time-series prediction module.<n>Experiments on the EHTask dataset, along with deployment on commercial VR hardware, show that our approach consistently outperforms baselines such as Center-of-HMD and Mean Gaze.
arXiv Detail & Related papers (2026-01-26T11:26:27Z)
YOLOA: Real-Time Affordance Detection via LLM Adapter [96.61111291833544]
Affordance detection aims to jointly address the fundamental "what-where-how" challenge in embodied AI.<n>We introduce YOLO Affordance (YOLOA), a real-time affordance detection model that jointly handles object detection and affordance learning.<n>Experiments on our relabeled ADG-Det and IIT-Heat benchmarks demonstrate that YOLOA achieves state-of-the-art accuracy while maintaining real-time performance.
arXiv Detail & Related papers (2025-12-03T03:53:31Z)
Barcode and QR Code Object Detection: An Experimental Study on YOLOv8 Models [2.0847503603392927]
This research work dives into an in-depth evaluation of the YOLOv8 (You Only Look Once) algorithm's efficiency in object detection.<n>Our goal was to optimize YOLOv8's overall performance throughout numerous situations and environments.<n>We achieved an accuracy of 88.95% for the nano model, 97.10% for the small model, and 94.10% for the medium version.
arXiv Detail & Related papers (2025-11-28T07:26:28Z)
Sim2Real Transfer for Vision-Based Grasp Verification [7.9471205712560264]
We present a vision-based approach for grasp verification to determine whether the robotic gripper has successfully grasped an object.<n>Our method employs a two-stage architecture; first YOLO-based object detection model to detect and locate the robot's gripper.<n>To address the limitations of real-world data capture, we introduce HSR-Grasp Synth, a synthetic dataset designed to simulate diverse grasping scenarios.
arXiv Detail & Related papers (2025-05-05T22:04:12Z)
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning [50.33337482489673]
This paper aims to enhance video perception with Reinforcement Fine-temporalning (RFT)<n>We develop VideoChat-R1, a powerful video MLLM that achieves state-the-art performance on-temporal tasks without sacrificing chat ability.<n>Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs.
arXiv Detail & Related papers (2025-04-09T15:09:27Z)
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning [57.285435980459205]
compositional visual reasoning approaches have shown promise as more effective strategies than end-to-end VR methods. We propose DWIM: Discrepancy-aware training generation, which assesses tool usage and extracts more viable for training. Instruct-Masking fine-tuning, which guides the model to only clone effective actions, enabling the generation of more practical solutions.
arXiv Detail & Related papers (2025-03-25T01:57:59Z)
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck. Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z)
What is YOLOv9: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector [0.0]
This study focuses on the YOLOv9 object detection model, focusing on its architectural innovations, training methodologies, and performance improvements. Key advancements, such as the Generalized Efficient Layer Aggregation Network GELAN and Programmable Gradient Information PGI, significantly enhance feature extraction and gradient flow. This paper provides the first in depth exploration of YOLOv9s internal features and their real world applicability, establishing it as a state of the art solution for real time object detection.
arXiv Detail & Related papers (2024-09-12T07:46:58Z)
Lightweight Object Detection: A Study Based on YOLOv7 Integrated with ShuffleNetv2 and Vision Transformer [0.0]
This study zeroes in on optimizing the YOLOv7 algorithm to boost its operational efficiency and speed on mobile platforms. The experimental outcomes reveal that the refined YOLO model demonstrates exceptional performance.
arXiv Detail & Related papers (2024-03-04T05:29:32Z)
YOLO-World: Real-Time Open-Vocabulary Object Detection [87.08732047660058]
We introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed.
arXiv Detail & Related papers (2024-01-30T18:59:38Z)
DM-VTON: Distilled Mobile Real-time Virtual Try-On [16.35842298296878]
Distilled Mobile Real-time Virtual Try-On (DM-VTON) is a novel virtual try-on framework designed to achieve simplicity and efficiency. We introduce an efficient Mobile Generative Module within the Student network, significantly reducing the runtime. Experimental results show that the proposed method can achieve 40 frames per second on a single Nvidia Tesla T4 GPU.
arXiv Detail & Related papers (2023-08-26T07:46:27Z)
YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection [80.11152626362109]
We provide an efficient and performant object detector, termed YOLO-MS. We train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets. Our work can also be used as a plug-and-play module for other YOLO models.
arXiv Detail & Related papers (2023-08-10T10:12:27Z)
Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z)
Towards Scale Consistent Monocular Visual Odometry by Learning from the Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data. We first train a scale-aware disparity network using both monocular real images and stereo virtual data. The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z)
HMD-EgoPose: Head-Mounted Display-Based Egocentric Marker-Less Tool and Hand Pose Estimation for Augmented Surgical Guidance [0.0]
We present HMD-EgoPose, a single-shot learning-based approach to hand and object pose estimation. We demonstrate state-of-the-art performance on a benchmark dataset for marker-less hand and surgical instrument pose tracking.
arXiv Detail & Related papers (2022-02-24T04:07:34Z)
Analysis of voxel-based 3D object detection methods efficiency for real-time embedded systems [93.73198973454944]
Two popular voxel-based 3D object detection methods are studied in this paper. Our experiments show that these methods mostly fail to detect distant small objects due to the sparsity of the input point clouds at large distances. Our findings suggest that a considerable part of the computations of existing methods is focused on locations of the scene that do not contribute with successful detection.
arXiv Detail & Related papers (2021-05-21T12:40:59Z)
Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem. We employ a Neural Message Passing network for data association that is fully trainable. We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z)
A Markerless Deep Learning-based 6 Degrees of Freedom PoseEstimation for with Mobile Robots using RGB Data [3.4806267677524896]
We propose a method to deploy state of the art neural networks for real time 3D object localization on augmented reality devices. We focus on fast 2D detection approaches which are extracting the 3D pose of the object fast and accurately by using only 2D input. For the 6D annotation of 2D images, we developed an annotation tool, which is, to our knowledge, the first open source tool to be available.
arXiv Detail & Related papers (2020-01-16T09:13:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.