Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving
- URL: http://arxiv.org/abs/2501.06680v1
- Date: Sun, 12 Jan 2025 01:31:07 GMT
- Title: Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving
- Authors: Haoxiang Gao, Yu Zhao,
- Abstract summary: We analyze effective knowledge distillation of semantic labels to smaller Vision networks.<n>This can be used for the semantic representation of complex scenes for downstream decision-making for planning and control.
- Score: 2.0122032639916485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autonomous driving (AD) has experienced significant improvements in recent years and achieved promising 3D detection, classification, and localization results. However, many challenges remain, e.g. semantic understanding of pedestrians' behaviors, and downstream handling for pedestrian interactions. Recent studies in applications of Large Language Models (LLM) and Vision-Language Models (VLM) have achieved promising results in scene understanding and high-level maneuver planning in diverse traffic scenarios. However, deploying the billion-parameter LLMs to vehicles requires significant computation and memory resources. In this paper, we analyzed effective knowledge distillation of semantic labels to smaller Vision networks, which can be used for the semantic representation of complex scenes for downstream decision-making for planning and control.
Related papers
- V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving [2.3302708486956454]
We introduce V3LMA, a novel approach that enhances 3D scene understanding by integrating Large Language Models (LLMs) with LVLMs.
V3LMA leverages textual descriptions generated from object detections and video inputs, significantly boosting performance without requiring fine-tuning.
Our method improves situational awareness and decision-making in complex traffic scenarios, achieving a score of 0.56 on the LingoQA benchmark.
arXiv Detail & Related papers (2025-04-30T20:00:37Z) - OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning.
Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2025-04-06T03:54:21Z) - NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models [11.184459657989914]
We introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding.
We also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs.
Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives.
arXiv Detail & Related papers (2025-03-17T03:12:39Z) - Scenario Understanding of Traffic Scenes Through Large Visual Language Models [2.3302708486956454]
Large Visual Language Models (LVLMs) present a compelling solution by automating image analysis and categorization through contextual queries.
In this study, we evaluate the capabilities of LVLMs to understand and classify urban traffic scenes on both an in-house dataset and the BDD100K.
We propose a scalable captioning pipeline that integrates state-of-the-art models, enabling a flexible deployment on new datasets.
arXiv Detail & Related papers (2025-01-28T18:23:12Z) - Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving [20.33096710167997]
generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving.
Cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories.
It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems.
arXiv Detail & Related papers (2025-01-15T15:20:46Z) - VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [20.43366384946928]
Vision-language models (VLMs) as teachers to enhance training.<n>VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z) - Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving.
Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z) - OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning.
Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2024-05-02T17:59:24Z) - Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving [0.0]
We develop an efficient, lightweight, multi-frame vision language model which performs Visual Question Answering for autonomous driving.
In comparison to previous approaches, EM-VLM4AD requires at least 10 times less memory and floating point operations.
arXiv Detail & Related papers (2024-03-28T21:18:33Z) - VLP: Vision Language Planning for Autonomous Driving [52.640371249017335]
This paper presents a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving.
It achieves state-of-the-art end-to-end planning performance on the NuScenes dataset by achieving 35.9% and 60.5% reduction in terms of average L2 error and collision rates, respectively.
arXiv Detail & Related papers (2024-01-10T23:00:40Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z) - Detecting 32 Pedestrian Attributes for Autonomous Vehicles [103.87351701138554]
In this paper, we address the problem of jointly detecting pedestrians and recognizing 32 pedestrian attributes.
We introduce a Multi-Task Learning (MTL) model relying on a composite field framework, which achieves both goals in an efficient way.
We show competitive detection and attribute recognition results, as well as a more stable MTL training.
arXiv Detail & Related papers (2020-12-04T15:10:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.