ADAPT: Action-aware Driving Caption Transformer
- URL: http://arxiv.org/abs/2302.00673v1
- Date: Wed, 1 Feb 2023 18:59:19 GMT
- Title: ADAPT: Action-aware Driving Caption Transformer
- Authors: Bu Jin, Xinyu Liu, Yupeng Zheng, Pengfei Li, Hao Zhao, Tong Zhang,
Yuhang Zheng, Guyue Zhou and Jingjing Liu
- Abstract summary: We propose an end-to-end transformer-based architecture, ADAPT, which provides user-friendly natural language narrations and reasoning for each decision making step of autonomous vehicular control and action.
Experiments on BDD-X dataset demonstrate state-of-the-art performance of the ADAPT framework on both automatic metrics and human evaluation.
To illustrate the feasibility of the proposed framework in real-world applications, we build a novel deployable system that takes raw car videos as input and outputs the action narrations and reasoning in real time.
- Score: 24.3857045947027
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: End-to-end autonomous driving has great potential in the transportation
industry. However, the lack of transparency and interpretability of the
automatic decision-making process hinders its industrial adoption in practice.
There have been some early attempts to use attention maps or cost volume for
better model explainability which is difficult for ordinary passengers to
understand. To bridge the gap, we propose an end-to-end transformer-based
architecture, ADAPT (Action-aware Driving cAPtion Transformer), which provides
user-friendly natural language narrations and reasoning for each decision
making step of autonomous vehicular control and action. ADAPT jointly trains
both the driving caption task and the vehicular control prediction task,
through a shared video representation. Experiments on BDD-X (Berkeley DeepDrive
eXplanation) dataset demonstrate state-of-the-art performance of the ADAPT
framework on both automatic metrics and human evaluation. To illustrate the
feasibility of the proposed framework in real-world applications, we build a
novel deployable system that takes raw car videos as input and outputs the
action narrations and reasoning in real time. The code, models and data are
available at https://github.com/jxbbb/ADAPT.
Related papers
- Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene [56.73568220959019]
Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial.
We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene.
We present the very first solution, using a combination of simulated collaborative data and real ego-car data.
arXiv Detail & Related papers (2025-02-10T17:07:53Z) - Doe-1: Closed-Loop Autonomous Driving with Large World Model [63.99937807085461]
We propose a large Driving wOrld modEl (Doe-1) for unified perception, prediction, and planning.
We use free-form texts for perception and generate future predictions directly in the RGB space with image tokens.
For planning, we employ a position-aware tokenizer to effectively encode action into discrete tokens.
arXiv Detail & Related papers (2024-12-12T18:59:59Z) - GPD-1: Generative Pre-training for Driving [77.06803277735132]
We propose a unified Generative Pre-training for Driving (GPD-1) model to accomplish all these tasks.
We represent each scene with ego, agent, and map tokens and formulate autonomous driving as a unified token generation problem.
Our GPD-1 successfully generalizes to various tasks without finetuning, including scene generation, traffic simulation, closed-loop simulation, map prediction, and motion planning.
arXiv Detail & Related papers (2024-12-11T18:59:51Z) - Pedestrian motion prediction evaluation for urban autonomous driving [0.0]
We analyze selected publications with provided open-source solutions to determine valuability of traditional motion prediction metrics.
This perspective should be valuable to any potential autonomous driving or robotics engineer looking for the real-world performance of the existing state-of-art pedestrian motion prediction problem.
arXiv Detail & Related papers (2024-10-22T10:06:50Z) - DriveCoT: Integrating Chain-of-Thought Reasoning with End-to-End Driving [81.04174379726251]
This paper collects a comprehensive end-to-end driving dataset named DriveCoT.
It contains sensor data, control decisions, and chain-of-thought labels to indicate the reasoning process.
We propose a baseline model called DriveCoT-Agent, trained on our dataset, to generate chain-of-thought predictions and final decisions.
arXiv Detail & Related papers (2024-03-25T17:59:01Z) - DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral
Planning States for Autonomous Driving [69.82743399946371]
DriveMLM is a framework that can perform close-loop autonomous driving in realistic simulators.
We employ a multi-modal LLM (MLLM) to model the behavior planning module of a module AD system.
This model can plug-and-play in existing AD systems such as Apollo for close-loop driving.
arXiv Detail & Related papers (2023-12-14T18:59:05Z) - On the Road with GPT-4V(ision): Early Explorations of Visual-Language
Model on Autonomous Driving [37.617793990547625]
This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V.
We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver.
Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems.
arXiv Detail & Related papers (2023-11-09T12:58:37Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - Fully End-to-end Autonomous Driving with Semantic Depth Cloud Mapping
and Multi-Agent [2.512827436728378]
We propose a novel deep learning model trained with end-to-end and multi-task learning manners to perform both perception and control tasks simultaneously.
The model is evaluated on CARLA simulator with various scenarios made of normal-adversarial situations and different weathers to mimic real-world conditions.
arXiv Detail & Related papers (2022-04-12T03:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.