Embodied Understanding of Driving Scenarios
- URL: http://arxiv.org/abs/2403.04593v1
- Date: Thu, 7 Mar 2024 15:39:18 GMT
- Title: Embodied Understanding of Driving Scenarios
- Authors: Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu,
Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li
- Abstract summary: Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios.
Here, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans.
ELM incorporates space-aware pre-training to endow the agent with robust spatial localization capabilities.
- Score: 44.21311841582762
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Embodied scene understanding serves as the cornerstone for autonomous agents
to perceive, interpret, and respond to open driving scenarios. Such
understanding is typically founded upon Vision-Language Models (VLMs).
Nevertheless, existing VLMs are restricted to the 2D domain, devoid of spatial
awareness and long-horizon extrapolation proficiencies. We revisit the key
aspects of autonomous driving and formulate appropriate rubrics. Hereby, we
introduce the Embodied Language Model (ELM), a comprehensive framework tailored
for agents' understanding of driving scenes with large spatial and temporal
spans. ELM incorporates space-aware pre-training to endow the agent with robust
spatial localization capabilities. Besides, the model employs time-aware token
selection to accurately inquire about temporal cues. We instantiate ELM on the
reformulated multi-faced benchmark, and it surpasses previous state-of-the-art
approaches in all aspects. All code, data, and models will be publicly shared.
Related papers
- HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation [59.675030933810106]
We present a unified Driving World Model named HERMES.
We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios.
HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%.
arXiv Detail & Related papers (2025-01-24T18:59:51Z) - VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [20.43366384946928]
Vision-language models (VLMs) as teachers to enhance training.
VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z) - OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving [12.004183122121042]
OccLLaMA is an occupancy-language-action generative world model.
We build a unified multi-modal vocabulary for vision, language and action.
OccLLaMA achieves competitive performance across multiple tasks.
arXiv Detail & Related papers (2024-09-05T06:30:01Z) - QuAD: Query-based Interpretable Neural Motion Planning for Autonomous Driving [33.609780917199394]
Self-driving vehicles must understand its environment to determine appropriate action.
Traditional systems rely on object detection to find agents in the scene.
We present a unified, interpretable, and efficient autonomy framework that moves away from cascading modules that first perceive occupancy relevant-temporal autonomy.
arXiv Detail & Related papers (2024-04-01T21:11:43Z) - On the Road with GPT-4V(ision): Early Explorations of Visual-Language
Model on Autonomous Driving [37.617793990547625]
This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V.
We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver.
Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems.
arXiv Detail & Related papers (2023-11-09T12:58:37Z) - LLM4Drive: A Survey of Large Language Models for Autonomous Driving [62.10344445241105]
Large language models (LLMs) have demonstrated abilities including understanding context, logical reasoning, and generating answers.
In this paper, we systematically review a research line about textitLarge Language Models for Autonomous Driving (LLM4AD).
arXiv Detail & Related papers (2023-11-02T07:23:33Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - Learning to Move with Affordance Maps [57.198806691838364]
The ability to autonomously explore and navigate a physical space is a fundamental requirement for virtually any mobile autonomous agent.
Traditional SLAM-based approaches for exploration and navigation largely focus on leveraging scene geometry.
We show that learned affordance maps can be used to augment traditional approaches for both exploration and navigation, providing significant improvements in performance.
arXiv Detail & Related papers (2020-01-08T04:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.