Cross-View Cross-Modal Unsupervised Domain Adaptation for Driver Monitoring System
- URL: http://arxiv.org/abs/2511.12196v1
- Date: Sat, 15 Nov 2025 13:04:35 GMT
- Title: Cross-View Cross-Modal Unsupervised Domain Adaptation for Driver Monitoring System
- Authors: Aditi Bhalla, Christian Hellert, Enkelejda Kasneci,
- Abstract summary: Driver distraction remains a leading cause of road traffic accidents, contributing to thousands of fatalities annually across the globe.<n>Deep learning-based driver activity recognition methods have shown promise in detecting such distractions, but their effectiveness in real-world deployments is hindered by two critical challenges.<n>We propose a novel two-phase cross-view, cross-modal unsupervised domain adaptation framework that addresses these challenges jointly on real-time driver monitoring data.
- Score: 11.688427092651914
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Driver distraction remains a leading cause of road traffic accidents, contributing to thousands of fatalities annually across the globe. While deep learning-based driver activity recognition methods have shown promise in detecting such distractions, their effectiveness in real-world deployments is hindered by two critical challenges: variations in camera viewpoints (cross-view) and domain shifts such as change in sensor modality or environment. Existing methods typically address either cross-view generalization or unsupervised domain adaptation in isolation, leaving a gap in the robust and scalable deployment of models across diverse vehicle configurations. In this work, we propose a novel two-phase cross-view, cross-modal unsupervised domain adaptation framework that addresses these challenges jointly on real-time driver monitoring data. In the first phase, we learn view-invariant and action-discriminative features within a single modality using contrastive learning on multi-view data. In the second phase, we perform domain adaptation to a new modality using information bottleneck loss without requiring any labeled data from the new domain. We evaluate our approach using state-of-the art video transformers (Video Swin, MViT) and multi modal driver activity dataset called Drive&Act, demonstrating that our joint framework improves top-1 accuracy on RGB video data by almost 50% compared to a supervised contrastive learning-based cross-view method, and outperforms unsupervised domain adaptation-only methods by up to 5%, using the same video transformer backbone.
Related papers
- Data Augmentation Strategies for Robust Lane Marking Detection [5.140388424678906]
This paper addresses the challenge of domain shift for side-mounted cameras used in lane-wheel monitoring by introducing a generative AI-based data enhancement pipeline.<n>The approach combines geometric perspective transformation, AI-driven inpainting, and vehicle body overlays to simulate deployment-specific viewpoints.<n>With the augmented data trained, both models show improved robustness to different conditions, including shadows.
arXiv Detail & Related papers (2025-11-24T00:47:27Z) - Domain-Shared Learning and Gradual Alignment for Unsupervised Domain Adaptation Visible-Infrared Person Re-Identification [15.508360109601973]
Visible, Visible-Infrared person Re-Identification (VI-ReID) has achieved remarkable performance on public datasets.<n>However, due to the discrepancies between public datasets and real-world data, most existing VI-ReID algorithms struggle in real-life applications.<n>We aim to transfer the knowledge learned from the public data to real-world data without compromising accuracy and requiring the annotation of new samples.
arXiv Detail & Related papers (2025-11-20T09:45:37Z) - DINO-CoDT: Multi-class Collaborative Detection and Tracking with Vision Foundation Models [11.34839442803445]
We propose a multi-class collaborative detection and tracking framework tailored for diverse road users.<n>We first present a detector with a global spatial attention fusion (GSAF) module, enhancing multi-scale feature learning for objects of varying sizes.<n>Next, we introduce a tracklet RE-IDentification (REID) module that leverages visual semantics with a vision foundation model to effectively reduce ID SWitch (IDSW) errors.
arXiv Detail & Related papers (2025-06-09T02:49:10Z) - Towards Full-scene Domain Generalization in Multi-agent Collaborative Bird's Eye View Segmentation for Connected and Autonomous Driving [49.03947018718156]
We propose a unified domain generalization framework to be utilized during the training and inference stages of collaborative perception.
We also introduce an intra-system domain alignment mechanism to reduce or potentially eliminate the domain discrepancy among connected and autonomous vehicles.
arXiv Detail & Related papers (2023-11-28T12:52:49Z) - Cross-Domain Car Detection Model with Integrated Convolutional Block
Attention Mechanism [3.3843451892622576]
Cross-domain car target detection model with integrated convolutional block Attention mechanism is proposed.
Experimental results show that the performance of the model improves by 40% over the model without our framework.
arXiv Detail & Related papers (2023-05-31T17:28:13Z) - M$^2$DAR: Multi-View Multi-Scale Driver Action Recognition with Vision
Transformer [5.082919518353888]
We present a multi-view, multi-scale framework for naturalistic driving action recognition and localization in untrimmed videos.
Our system features a weight-sharing, multi-scale Transformer-based action recognition network that learns robust hierarchical representations.
arXiv Detail & Related papers (2023-05-13T02:38:15Z) - DualCross: Cross-Modality Cross-Domain Adaptation for Monocular BEV Perception [30.113617846516398]
DualCross is a cross-modality cross-domain adaptation framework to facilitate the learning of a more robust BEV perception model.
This work results in the first open analysis of cross-domain cross-sensor perception and adaptation for monocular 3D tasks in the wild.
arXiv Detail & Related papers (2023-05-05T17:58:45Z) - Stagewise Unsupervised Domain Adaptation with Adversarial Self-Training
for Road Segmentation of Remote Sensing Images [93.50240389540252]
Road segmentation from remote sensing images is a challenging task with wide ranges of application potentials.
We propose a novel stagewise domain adaptation model called RoadDA to address the domain shift (DS) issue in this field.
Experiment results on two benchmarks demonstrate that RoadDA can efficiently reduce the domain gap and outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-08-28T09:29:14Z) - Learning Cross-modal Contrastive Features for Video Domain Adaptation [138.75196499580804]
We propose a unified framework for video domain adaptation, which simultaneously regularizes cross-modal and cross-domain feature representations.
Specifically, we treat each modality in a domain as a view and leverage the contrastive learning technique with properly designed sampling strategies.
arXiv Detail & Related papers (2021-08-26T18:14:18Z) - Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency [90.71745178767203]
Deep learning-based 3D object detection has achieved unprecedented success with the advent of large-scale autonomous driving datasets.
Existing 3D domain adaptive detection methods often assume prior access to the target domain annotations, which is rarely feasible in the real world.
We study a more realistic setting, unsupervised 3D domain adaptive detection, which only utilizes source domain annotations.
arXiv Detail & Related papers (2021-07-23T17:19:23Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z) - Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area.
Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos.
This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z) - Unsupervised Domain Adaptation in Person re-ID via k-Reciprocal
Clustering and Large-Scale Heterogeneous Environment Synthesis [76.46004354572956]
We introduce an unsupervised domain adaptation approach for person re-identification.
Experimental results show that the proposed ktCUDA and SHRED approach achieves an average improvement of +5.7 mAP in re-identification performance.
arXiv Detail & Related papers (2020-01-14T17:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.