On the Road with GPT-4V(ision): Early Explorations of Visual-Language
Model on Autonomous Driving
- URL: http://arxiv.org/abs/2311.05332v2
- Date: Tue, 28 Nov 2023 09:47:57 GMT
- Title: On the Road with GPT-4V(ision): Early Explorations of Visual-Language
Model on Autonomous Driving
- Authors: Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai,
Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun,
Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi, Yu Qiao
- Abstract summary: This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V.
We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver.
Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems.
- Score: 37.617793990547625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pursuit of autonomous driving technology hinges on the sophisticated
integration of perception, decision-making, and control systems. Traditional
approaches, both data-driven and rule-based, have been hindered by their
inability to grasp the nuance of complex driving environments and the
intentions of other road users. This has been a significant bottleneck,
particularly in the development of common sense reasoning and nuanced scene
understanding necessary for safe and reliable autonomous driving. The advent of
Visual Language Models (VLM) represents a novel frontier in realizing fully
autonomous vehicle driving. This report provides an exhaustive evaluation of
the latest state-of-the-art VLM, GPT-4V(ision), and its application in
autonomous driving scenarios. We explore the model's abilities to understand
and reason about driving scenes, make decisions, and ultimately act in the
capacity of a driver. Our comprehensive tests span from basic scene recognition
to complex causal reasoning and real-time decision-making under varying
conditions. Our findings reveal that GPT-4V demonstrates superior performance
in scene understanding and causal reasoning compared to existing autonomous
systems. It showcases the potential to handle out-of-distribution scenarios,
recognize intentions, and make informed decisions in real driving contexts.
However, challenges remain, particularly in direction discernment, traffic
light recognition, vision grounding, and spatial reasoning tasks. These
limitations underscore the need for further research and development. Project
is now available on GitHub for interested parties to access and utilize:
\url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}
Related papers
- Pedestrian motion prediction evaluation for urban autonomous driving [0.0]
We analyze selected publications with provided open-source solutions to determine valuability of traditional motion prediction metrics.
This perspective should be valuable to any potential autonomous driving or robotics engineer looking for the real-world performance of the existing state-of-art pedestrian motion prediction problem.
arXiv Detail & Related papers (2024-10-22T10:06:50Z) - Exploring the Causality of End-to-End Autonomous Driving [57.631400236930375]
We propose a comprehensive approach to explore and analyze the causality of end-to-end autonomous driving.
Our work is the first to unveil the mystery of end-to-end autonomous driving and turn the black box into a white one.
arXiv Detail & Related papers (2024-07-09T04:56:11Z) - Applications of Computer Vision in Autonomous Vehicles: Methods, Challenges and Future Directions [2.693342141713236]
This paper reviews publications on computer vision and autonomous driving that are published during the last ten years.
In particular, we first investigate the development of autonomous driving systems and summarize these systems that are developed by the major automotive manufacturers from different countries.
Then, a comprehensive overview of computer vision applications for autonomous driving such as depth estimation, object detection, lane detection, and traffic sign recognition are discussed.
arXiv Detail & Related papers (2023-11-15T16:41:18Z) - LLM4Drive: A Survey of Large Language Models for Autonomous Driving [62.10344445241105]
Large language models (LLMs) have demonstrated abilities including understanding context, logical reasoning, and generating answers.
In this paper, we systematically review a research line about textitLarge Language Models for Autonomous Driving (LLM4AD).
arXiv Detail & Related papers (2023-11-02T07:23:33Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [84.29836263441136]
This study introduces DriveGPT4, a novel interpretable end-to-end autonomous driving system based on multimodal large language models (MLLMs)
DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users.
arXiv Detail & Related papers (2023-10-02T17:59:52Z) - Drive Like a Human: Rethinking Autonomous Driving with Large Language
Models [28.957124302293966]
We explore the potential of using a large language model (LLM) to understand the driving environment in a human-like manner.
Our experiments show that the LLM exhibits the impressive ability to reason and solve long-tailed cases.
arXiv Detail & Related papers (2023-07-14T05:18:34Z) - Exploring Contextual Representation and Multi-Modality for End-to-End
Autonomous Driving [58.879758550901364]
Recent perception systems enhance spatial understanding with sensor fusion but often lack full environmental context.
We introduce a framework that integrates three cameras to emulate the human field of view, coupled with top-down bird-eye-view semantic data to enhance contextual representation.
Our method achieves displacement error by 0.67m in open-loop settings, surpassing current methods by 6.9% on the nuScenes dataset.
arXiv Detail & Related papers (2022-10-13T05:56:20Z) - Explainability of vision-based autonomous driving systems: Review and
challenges [33.720369945541805]
The need for explainability is strong in driving, a safety-critical application.
This survey gathers contributions from several research fields, namely computer vision, deep learning, autonomous driving, explainable AI (X-AI)
arXiv Detail & Related papers (2021-01-13T19:09:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.