Embodied Understanding of Driving Scenarios
- URL: http://arxiv.org/abs/2403.04593v1
- Date: Thu, 7 Mar 2024 15:39:18 GMT
- Title: Embodied Understanding of Driving Scenarios
- Authors: Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu,
Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li
- Abstract summary: Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios.
Here, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans.
ELM incorporates space-aware pre-training to endow the agent with robust spatial localization capabilities.
- Score: 44.21311841582762
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Embodied scene understanding serves as the cornerstone for autonomous agents
to perceive, interpret, and respond to open driving scenarios. Such
understanding is typically founded upon Vision-Language Models (VLMs).
Nevertheless, existing VLMs are restricted to the 2D domain, devoid of spatial
awareness and long-horizon extrapolation proficiencies. We revisit the key
aspects of autonomous driving and formulate appropriate rubrics. Hereby, we
introduce the Embodied Language Model (ELM), a comprehensive framework tailored
for agents' understanding of driving scenes with large spatial and temporal
spans. ELM incorporates space-aware pre-training to endow the agent with robust
spatial localization capabilities. Besides, the model employs time-aware token
selection to accurately inquire about temporal cues. We instantiate ELM on the
reformulated multi-faced benchmark, and it surpasses previous state-of-the-art
approaches in all aspects. All code, data, and models will be publicly shared.
Related papers
- STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding? [15.419733591210514]
Multimodal Large Language Models (MLLMs) are an end-to-end solution for Embodied AI and Autonomous Driving.
We introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding.
Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios.
arXiv Detail & Related papers (2025-03-31T06:30:35Z) - InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving [3.8737986316149775]
We propose a novel end-to-end autonomous driving method called InsightDrive.
It organizes perception by language-guided scene representation.
In experiments, InsightDrive achieves state-of-the-art performance in end-to-end autonomous driving.
arXiv Detail & Related papers (2025-03-17T10:52:32Z) - Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning [24.511628941825116]
We introduce Sce2DriveX, a human-like driving chain-of-thought (CoT) reasoning framework framework.
It reconstructs the implicit cognitive chain inherent in human driving, covering scene understanding, meta-action reasoning, behavior interpretation analysis, motion planning and control.
It achieves state-of-the-art performance from scene understanding to end-to-end driving, as well as robust generalization on the CARLA Bench2Drive benchmark.
arXiv Detail & Related papers (2025-02-19T09:50:44Z) - HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation [59.675030933810106]
We present a unified Driving World Model named HERMES.
We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios.
HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%.
arXiv Detail & Related papers (2025-01-24T18:59:51Z) - VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [20.43366384946928]
Vision-language models (VLMs) as teachers to enhance training.
VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z) - OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving [12.004183122121042]
OccLLaMA is an occupancy-language-action generative world model.
We build a unified multi-modal vocabulary for vision, language and action.
OccLLaMA achieves competitive performance across multiple tasks.
arXiv Detail & Related papers (2024-09-05T06:30:01Z) - OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving [62.54220021308464]
We propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving.
OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes.
arXiv Detail & Related papers (2024-05-30T17:59:42Z) - QuAD: Query-based Interpretable Neural Motion Planning for Autonomous Driving [33.609780917199394]
Self-driving vehicles must understand its environment to determine appropriate action.
Traditional systems rely on object detection to find agents in the scene.
We present a unified, interpretable, and efficient autonomy framework that moves away from cascading modules that first perceive occupancy relevant-temporal autonomy.
arXiv Detail & Related papers (2024-04-01T21:11:43Z) - On the Road with GPT-4V(ision): Early Explorations of Visual-Language
Model on Autonomous Driving [37.617793990547625]
This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V.
We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver.
Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems.
arXiv Detail & Related papers (2023-11-09T12:58:37Z) - LLM4Drive: A Survey of Large Language Models for Autonomous Driving [62.10344445241105]
Large language models (LLMs) have demonstrated abilities including understanding context, logical reasoning, and generating answers.
In this paper, we systematically review a research line about textitLarge Language Models for Autonomous Driving (LLM4AD).
arXiv Detail & Related papers (2023-11-02T07:23:33Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - Context-Aware Timewise VAEs for Real-Time Vehicle Trajectory Prediction [4.640835690336652]
We present ContextVAE, a context-aware approach for multi-modal vehicle trajectory prediction.
Our approach takes into account both the social features exhibited by agents on the scene and the physical environment constraints.
In all tested datasets, ContextVAE models are fast to train and provide high-quality multi-modal predictions in real-time.
arXiv Detail & Related papers (2023-02-21T18:42:24Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - Learning to Move with Affordance Maps [57.198806691838364]
The ability to autonomously explore and navigate a physical space is a fundamental requirement for virtually any mobile autonomous agent.
Traditional SLAM-based approaches for exploration and navigation largely focus on leveraging scene geometry.
We show that learned affordance maps can be used to augment traditional approaches for both exploration and navigation, providing significant improvements in performance.
arXiv Detail & Related papers (2020-01-08T04:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.