Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
- URL: http://arxiv.org/abs/2306.10159v4
- Date: Thu, 21 Mar 2024 04:17:26 GMT
- Title: Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
- Authors: Md Zahid Hasan, Jiajing Chen, Jiyang Wang, Mohammed Shaiqur Rahman, Ameya Joshi, Senem Velipasalar, Chinmay Hegde, Anuj Sharma, Soumik Sarkar,
- Abstract summary: This paper proposes a CLIP-based driver activity recognition approach that identifies driver distraction from naturalistic driving images and videos.
Our results show that this framework offers state-of-the-art performance on zero-shot transfer and video-based CLIP for predicting the driver's state on two public datasets.
- Score: 29.529768377746194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognizing the activities causing distraction in real-world driving scenarios is critical for ensuring the safety and reliability of both drivers and pedestrians on the roadways. Conventional computer vision techniques are typically data-intensive and require a large volume of annotated training data to detect and classify various distracted driving behaviors, thereby limiting their efficiency and scalability. We aim to develop a generalized framework that showcases robust performance with access to limited or no annotated training data. Recently, vision-language models have offered large-scale visual-textual pretraining that can be adapted to task-specific learning like distracted driving activity recognition. Vision-language pretraining models, such as CLIP, have shown significant promise in learning natural language-guided visual representations. This paper proposes a CLIP-based driver activity recognition approach that identifies driver distraction from naturalistic driving images and videos. CLIP's vision embedding offers zero-shot transfer and task-based finetuning, which can classify distracted activities from driving video data. Our results show that this framework offers state-of-the-art performance on zero-shot transfer and video-based CLIP for predicting the driver's state on two public datasets. We propose both frame-based and video-based frameworks developed on top of the CLIP's visual representation for distracted driving detection and classification tasks and report the results.
Related papers
- Towards Infusing Auxiliary Knowledge for Distracted Driver Detection [11.816566371802802]
Distracted driving is a leading cause of road accidents globally.
We propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver's pose.
Specifically, we construct a unified framework that integrates the scene graphs, and driver pose information with the visual cues in video frames to create a holistic representation of the driver's actions.
arXiv Detail & Related papers (2024-08-29T15:28:42Z) - Federated Learning for Drowsiness Detection in Connected Vehicles [0.19116784879310028]
Driver monitoring systems can assist in determining the driver's state.
Driver drowsiness detection presents a potential solution.
transmitting the data to a central machine for model training is impractical due to the large data size and privacy concerns.
We propose a federated learning framework for drowsiness detection within a vehicular network, leveraging the YawDD dataset.
arXiv Detail & Related papers (2024-05-06T09:39:13Z) - Driver Activity Classification Using Generalizable Representations from Vision-Language Models [0.0]
We present a novel approach leveraging generalizable representations from vision-language models for driver activity classification.
Our results suggest that vision-language representations offer a promising avenue for driver monitoring systems.
arXiv Detail & Related papers (2024-04-23T10:42:24Z) - Policy Pre-training for End-to-end Autonomous Driving via
Self-supervised Geometric Modeling [96.31941517446859]
We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving.
We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos.
In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input.
In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
arXiv Detail & Related papers (2023-01-03T08:52:49Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - TransDARC: Transformer-based Driver Activity Recognition with Latent
Space Feature Calibration [31.908276711898548]
We present a vision-based framework for recognizing secondary driver behaviours based on visual transformers and an augmented feature distribution calibration module.
Our framework consistently leads to better recognition rates, surpassing previous state-of-the-art results of the public Drive&Act benchmark on all levels.
arXiv Detail & Related papers (2022-03-02T08:14:06Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Self-Supervised Steering Angle Prediction for Vehicle Control Using
Visual Odometry [55.11913183006984]
We show how a model can be trained to control a vehicle's trajectory using camera poses estimated through visual odometry methods.
We propose a scalable framework that leverages trajectory information from several different runs using a camera setup placed at the front of a car.
arXiv Detail & Related papers (2021-03-20T16:29:01Z) - The Multimodal Driver Monitoring Database: A Naturalistic Corpus to
Study Driver Attention [44.94118128276982]
A smart vehicle should be able to monitor the actions and behaviors of the human driver to provide critical warnings or intervene when necessary.
Recent advancements in deep learning and computer vision have shown great promise in monitoring human behaviors and activities.
A vast amount of in-domain data is required to train models that provide high performance in predicting driving related tasks.
arXiv Detail & Related papers (2020-12-23T16:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.