Cross-Paradigm Evaluation of Gaze-Based Semantic Object Identification for Intelligent Vehicles
- URL: http://arxiv.org/abs/2602.01452v1
- Date: Sun, 01 Feb 2026 21:43:02 GMT
- Title: Cross-Paradigm Evaluation of Gaze-Based Semantic Object Identification for Intelligent Vehicles
- Authors: Penghao Deng, Jidong J. Yang, Jiachen Bian,
- Abstract summary: This paper tackles this challenge as a semantic identification task from the road scenes captured by a vehicle's front-view camera.<n>Three vision-based approaches are investigated: direct object detection (YOLOv13), segmentation-assisted classification (SAM2 paired with EfficientNetV2 versus YOLOv13), and query-based Vision-Language Models, VLMs.<n>The results demonstrate that the direct object detection (YOLOv13) and Qwen2.5-VL-32b significantly outperform other approaches, achieving Macro F1-Scores over 0.84.
- Score: 2.867517731896504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding where drivers direct their visual attention during driving, as characterized by gaze behavior, is critical for developing next-generation advanced driver-assistance systems and improving road safety. This paper tackles this challenge as a semantic identification task from the road scenes captured by a vehicle's front-view camera. Specifically, the collocation of gaze points with object semantics is investigated using three distinct vision-based approaches: direct object detection (YOLOv13), segmentation-assisted classification (SAM2 paired with EfficientNetV2 versus YOLOv13), and query-based Vision-Language Models, VLMs (Qwen2.5-VL-7b versus Qwen2.5-VL-32b). The results demonstrate that the direct object detection (YOLOv13) and Qwen2.5-VL-32b significantly outperform other approaches, achieving Macro F1-Scores over 0.84. The large VLM (Qwen2.5-VL-32b), in particular, exhibited superior robustness and performance for identifying small, safety-critical objects such as traffic lights, especially in adverse nighttime conditions. Conversely, the segmentation-assisted paradigm suffers from a "part-versus-whole" semantic gap that led to large failure in recall. The results reveal a fundamental trade-off between the real-time efficiency of traditional detectors and the richer contextual understanding and robustness offered by large VLMs. These findings provide critical insights and practical guidance for the design of future human-aware intelligent driver monitoring systems.
Related papers
- A Comparative Evaluation of Large Vision-Language Models for 2D Object Detection under SOTIF Conditions [2.7694879331630182]
This paper presents a systematic evaluation of Large Vision-Language Models (LVLMs) for safety-critical 2D object detection.<n>The PeSOTIF dataset is a benchmark specifically curated for long-tail traffic scenarios and environmental degradations.<n> Experimental results reveal a critical trade-off: top-performing LVLMs surpass the YOLO baseline in recall by over 25% in complex natural scenarios.
arXiv Detail & Related papers (2026-01-30T10:58:24Z) - Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios [0.0]
We propose an efficient obstacle avoidance pipeline that leverages a camera-only perception module and a Frenet-Pure Pursuit-based planning strategy.<n>By integrating advancements in computer vision, the system utilizes YOLOv11 for object detection and state-of-the-art monocular depth estimation models, such as Depth Anything V2, to estimate object distances.<n>The system is evaluated in diverse scenarios on a university campus, demonstrating its effectiveness in handling various obstacles and enhancing autonomous navigation.
arXiv Detail & Related papers (2025-07-16T17:41:14Z) - Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving [55.96227460521096]
Vision-Language Models (VLMs) have been integrated into autonomous driving systems to enhance reasoning capabilities.<n>We propose a natural reflection-based backdoor attack targeting VLM systems in autonomous driving scenarios.<n>Our findings uncover a new class of attacks that exploit the stringent real-time requirements of autonomous driving.
arXiv Detail & Related papers (2025-05-09T20:28:17Z) - Salient Object Detection in Traffic Scene through the TSOD10K Dataset [22.615252113004402]
Traffic Salient Object Detection (TSOD) aims to segment the objects critical to driving safety by combining semantic (e.g., collision risks) and visual saliency.<n>Our research establishes the first foundation for safety-aware saliency analysis in intelligent transportation systems.
arXiv Detail & Related papers (2025-03-21T07:21:24Z) - Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives [56.528835143531694]
We introduce DriveBench, a benchmark dataset designed to evaluate Vision-Language Models (VLMs)<n>Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding.<n>We propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding.
arXiv Detail & Related papers (2025-01-07T18:59:55Z) - Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models [61.899791071654654]
We introduce a benchmark, Q-Spatial Bench, with 271 questions across five categories designed for quantitative spatial reasoning.
We investigate the performance of state-of-the-art vision-language models (VLMs) on this task.
We develop a zero-shot prompting technique, SpatialPrompt, that encourages VLMs to answer quantitative spatial questions using reference objects as visual cues.
arXiv Detail & Related papers (2024-09-15T16:45:42Z) - FENet: Focusing Enhanced Network for Lane Detection [0.0]
This research pioneers networks augmented with Focusing Sampling, Partial Field of View Evaluation, Enhanced FPN architecture and Directional IoU Loss.
Experiments demonstrate our Focusing Sampling strategy, emphasizing vital distant details unlike uniform approaches.
Future directions include collecting on-road data and integrating complementary dual frameworks to further breakthroughs guided by human perception principles.
arXiv Detail & Related papers (2023-12-28T17:52:09Z) - DRUformer: Enhancing the driving scene Important object detection with
driving relationship self-understanding [50.81809690183755]
Traffic accidents frequently lead to fatal injuries, contributing to over 50 million deaths until 2023.
Previous research primarily assessed the importance of individual participants, treating them as independent entities.
We introduce Driving scene Relationship self-Understanding transformer (DRUformer) to enhance the important object detection task.
arXiv Detail & Related papers (2023-11-11T07:26:47Z) - OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping [84.65114565766596]
We present OpenLane-V2, the first dataset on topology reasoning for traffic scene structure.
OpenLane-V2 consists of 2,000 annotated road scenes that describe traffic elements and their correlation to the lanes.
We evaluate various state-of-the-art methods, and present their quantitative and qualitative results on OpenLane-V2 to indicate future avenues for investigating topology reasoning in traffic scenes.
arXiv Detail & Related papers (2023-04-20T16:31:22Z) - Recurrent Vision Transformers for Object Detection with Event Cameras [62.27246562304705]
We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras.
RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection.
Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision.
arXiv Detail & Related papers (2022-12-11T20:28:59Z) - Blind-Spot Collision Detection System for Commercial Vehicles Using
Multi Deep CNN Architecture [0.17499351967216337]
Two convolutional neural networks (CNNs) based on high-level feature descriptors are proposed to detect blind-spot collisions for heavy vehicles.
A fusion approach is proposed to integrate two pre-trained networks for extracting high level features for blind-spot vehicle detection.
The fusion of features significantly improves the performance of faster R-CNN and outperformed the existing state-of-the-art methods.
arXiv Detail & Related papers (2022-08-17T11:10:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.