Related papers: Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling

Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling

URL: http://arxiv.org/abs/2601.08467v1
Date: Tue, 13 Jan 2026 11:46:05 GMT
Title: Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling
Authors: Takamichi Miyata, Sumiko Miyata, Andrew Morris,
Abstract summary: Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions.<n>We identify subject-specific appearance variations as a key bottleneck, leading to decisions driven by who the driver is rather than what the driver is doing.<n>We propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification.
Score: 0.6882042556551609
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distracted driving is a major cause of traffic collisions, calling for robust and scalable detection methods. Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions. We identify subject-specific appearance variations (e.g., clothing, age, and gender) as a key bottleneck: VLMs entangle these factors with behavior cues, leading to decisions driven by who the driver is rather than what the driver is doing. To address this, we propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification, thereby emphasizing distraction-relevant evidence. We further orthogonalize text embeddings via metric projection onto Stiefel manifold to improve separability while staying close to the original semantics. Experiments demonstrate consistent gains over prior baselines, indicating the promise of our approach for practical road-safety applications.

Related papers

Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving [55.13109926181247]
We introduce ReflectDrive, a learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion.<n>Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient.<n>Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors.
arXiv Detail & Related papers (2025-09-24T13:35:15Z)
Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving [55.96227460521096]
Vision-Language Models (VLMs) have been integrated into autonomous driving systems to enhance reasoning capabilities.<n>We propose a natural reflection-based backdoor attack targeting VLM systems in autonomous driving scenarios.<n>Our findings uncover a new class of attacks that exploit the stringent real-time requirements of autonomous driving.
arXiv Detail & Related papers (2025-05-09T20:28:17Z)
An object detection approach for lane change and overtake detection from motion profiles [3.545178658731506]
In this paper, we address the identification of overtake and lane change maneuvers with a novel object detection approach applied to motion profiles.<n>To train and test our model we created an internal dataset of motion profile images obtained from a heterogeneous set of dashcam videos.<n>In addition to a standard object-detection approach, we show how the inclusion of CoordConvolution layers further improves the model performance.
arXiv Detail & Related papers (2025-02-06T17:36:35Z)
Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving [65.61999354218628]
We take the first step toward designing black-box adversarial attacks specifically targeting vision-language models (VLMs) in autonomous driving systems.<n>We propose Cascading Adversarial Disruption (CAD), which targets low-level reasoning breakdown by generating and injecting semantics.<n>We present Risky Scene Induction, which addresses dynamic adaptation by leveraging a surrogate VLM to understand and construct high-level risky scenarios.
arXiv Detail & Related papers (2025-01-23T11:10:02Z)
Cross-Camera Distracted Driver Classification through Feature Disentanglement and Contrastive Learning [13.613407983544427]
Driver Behavior Monitoring Network (DBMNet) relies on a lightweight backbone and integrates a disentanglement module to discard camera view information.<n>DBMNet achieves an improvement of 7% in Top-1 accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-11-20T10:27:12Z)
Towards Infusing Auxiliary Knowledge for Distracted Driver Detection [11.816566371802802]
Distracted driving is a leading cause of road accidents globally. We propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver's pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver pose information with the visual cues in video frames to create a holistic representation of the driver's actions.
arXiv Detail & Related papers (2024-08-29T15:28:42Z)
Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling in Driving Videos [22.16190711818432]
We introduce TTHF, a novel single-stage method aligning video clips with text prompts, offering a new perspective on traffic anomaly detection. Unlike previous approaches, the supervised signal of our method is derived from languages rather than one-hot vectors, providing a more comprehensive representation. It is shown that our proposed TTHF achieves promising performance, outperforming state-of-the-art competitors by +5.4% AUC on the DoTA dataset.
arXiv Detail & Related papers (2024-01-07T15:47:19Z)
PoseViNet: Distracted Driver Action Recognition Framework Using Multi-View Pose Estimation and Vision Transformer [1.319058156672392]
This paper introduces a novel method for detection of driver distraction using multi-view driver action images. The proposed method is a vision transformer-based framework with pose estimation and action inference, namely PoseViNet. The PoseViNet achieves 97.55% validation accuracy and 90.92% testing accuracy with the challenging dataset.
arXiv Detail & Related papers (2023-12-22T10:13:10Z)
DRUformer: Enhancing the driving scene Important object detection with driving relationship self-understanding [50.81809690183755]
Traffic accidents frequently lead to fatal injuries, contributing to over 50 million deaths until 2023. Previous research primarily assessed the importance of individual participants, treating them as independent entities. We introduce Driving scene Relationship self-Understanding transformer (DRUformer) to enhance the important object detection task.
arXiv Detail & Related papers (2023-11-11T07:26:47Z)
FBLNet: FeedBack Loop Network for Driver Attention Prediction [50.936478241688114]
Nonobjective driving experience is difficult to model, so a mechanism simulating the driver experience accumulation procedure is absent in existing methods.<n>We propose a FeedBack Loop Network (FBLNet), which attempts to model the driving experience accumulation procedure.<n>Our model exhibits a solid advantage over existing methods, achieving an outstanding performance improvement on two driver attention benchmark datasets.
arXiv Detail & Related papers (2022-12-05T08:25:09Z)
Driver Glance Classification In-the-wild: Towards Generalization Across Domains and Subjects [5.562102367018285]
Driver assistance systems (ADAS) with the ability to detect driver distraction can help prevent accidents and improve driver safety. We propose a model that takes as input a patch of the driver's face along with a crop of the eye-region and classifies their glance into 6 coarse regions-of-interest (ROIs) in the vehicle.
arXiv Detail & Related papers (2020-12-05T00:23:01Z)
Studying Person-Specific Pointing and Gaze Behavior for Multimodal Referencing of Outside Objects from a Moving Vehicle [58.720142291102135]
Hand pointing and eye gaze have been extensively investigated in automotive applications for object selection and referencing. Existing outside-the-vehicle referencing methods focus on a static situation, whereas the situation in a moving vehicle is highly dynamic and subject to safety-critical constraints. We investigate the specific characteristics of each modality and the interaction between them when used in the task of referencing outside objects.
arXiv Detail & Related papers (2020-09-23T14:56:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.