Exploration of VLMs for Driver Monitoring Systems Applications
- URL: http://arxiv.org/abs/2503.12281v1
- Date: Sat, 15 Mar 2025 22:37:36 GMT
- Title: Exploration of VLMs for Driver Monitoring Systems Applications
- Authors: Paola Natalia Cañas, Marcos Nieto, Oihana Otaegui, Igor Rodríguez,
- Abstract summary: In recent years, we have witnessed significant progress in emerging deep learning models, particularly Large Language Models (LLMs) and Vision-Language Models (VLMs)<n>This paper presents our initial approach to implementing VLMs in Driver Monitoring Systems (DMS)
- Score: 3.59361692183907
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In recent years, we have witnessed significant progress in emerging deep learning models, particularly Large Language Models (LLMs) and Vision-Language Models (VLMs). These models have demonstrated promising results, indicating a new era of Artificial Intelligence (AI) that surpasses previous methodologies. Their extensive knowledge and zero-shot capabilities suggest a paradigm shift in developing deep learning solutions, moving from data capturing and algorithm training to just writing appropriate prompts. While the application of these technologies has been explored across various industries, including automotive, there is a notable gap in the scientific literature regarding their use in Driver Monitoring Systems (DMS). This paper presents our initial approach to implementing VLMs in this domain, utilising the Driver Monitoring Dataset to evaluate their performance and discussing their advantages and challenges when implemented in real-world scenarios.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques [6.783762650831429]
We review the fundamental theories related to visual language models (VLMs) and the datasets constructed for them in remote sensing.<n>We categorize the improvement methods into three main parts according to the core components ofVLMs and provide a detailed introduction and comparison of these methods.
arXiv Detail & Related papers (2024-10-15T13:28:55Z) - AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [11.786387517781328]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering.
Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.
We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z) - Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - Towards Efficient Generative Large Language Model Serving: A Survey from
Algorithms to Systems [14.355768064425598]
generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data.
However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency.
This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective.
arXiv Detail & Related papers (2023-12-23T11:57:53Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - A Survey of Serverless Machine Learning Model Inference [0.0]
Generative AI, Computer Vision, and Natural Language Processing have led to an increased integration of AI models into various products.
This survey aims to summarize and categorize the emerging challenges and optimization opportunities for large-scale deep learning serving systems.
arXiv Detail & Related papers (2023-11-22T18:46:05Z) - Vision Language Models in Autonomous Driving: A Survey and Outlook [26.70381732289961]
Vision-Language Models (VLMs) have attracted widespread attention due to their outstanding performance and the ability to leverage Large Language Models (LLMs)
We present a comprehensive and systematic survey of the advances in vision language models in this domain, encompassing perception and understanding, navigation and planning, decision-making and control, end-to-end autonomous driving, and data generation.
arXiv Detail & Related papers (2023-10-22T21:06:10Z) - Recommender Systems in the Era of Large Language Models (LLMs) [62.0129013439038]
Large Language Models (LLMs) have revolutionized the fields of Natural Language Processing (NLP) and Artificial Intelligence (AI)
We conduct a comprehensive review of LLM-empowered recommender systems from various aspects including Pre-training, Fine-tuning, and Prompting.
arXiv Detail & Related papers (2023-07-05T06:03:40Z) - A Survey on Large Language Models for Recommendation [77.91673633328148]
Large Language Models (LLMs) have emerged as powerful tools in the field of Natural Language Processing (NLP)
This survey presents a taxonomy that categorizes these models into two major paradigms, respectively Discriminative LLM for Recommendation (DLLM4Rec) and Generative LLM for Recommendation (GLLM4Rec)
arXiv Detail & Related papers (2023-05-31T13:51:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.