Related papers: On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

URL: http://arxiv.org/abs/2311.05332v2
Date: Tue, 28 Nov 2023 09:47:57 GMT
Title: On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
Authors: Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi, Yu Qiao
Abstract summary: This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V. We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver. Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems.
Score: 37.617793990547625
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The pursuit of autonomous driving technology hinges on the sophisticated integration of perception, decision-making, and control systems. Traditional approaches, both data-driven and rule-based, have been hindered by their inability to grasp the nuance of complex driving environments and the intentions of other road users. This has been a significant bottleneck, particularly in the development of common sense reasoning and nuanced scene understanding necessary for safe and reliable autonomous driving. The advent of Visual Language Models (VLM) represents a novel frontier in realizing fully autonomous vehicle driving. This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V(ision), and its application in autonomous driving scenarios. We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver. Our comprehensive tests span from basic scene recognition to complex causal reasoning and real-time decision-making under varying conditions. Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems. It showcases the potential to handle out-of-distribution scenarios, recognize intentions, and make informed decisions in real driving contexts. However, challenges remain, particularly in direction discernment, traffic light recognition, vision grounding, and spatial reasoning tasks. These limitations underscore the need for further research and development. Project is now available on GitHub for interested parties to access and utilize: \url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}

Related papers

A Survey on Vision-Language-Action Models for Autonomous Driving [26.407082158880204]
Vision-Language-Action (VLA) paradigms integrate visual perception, natural language understanding, and control within a single policy.<n>Researchers in autonomous driving are actively adapting these methods to the vehicle domain.<n>This survey offers the first comprehensive overview of VLA for Autonomous Driving.
arXiv Detail & Related papers (2025-06-30T16:50:02Z)
The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey [50.62538723793247]
Driving World Model (DWM) focuses on predicting scene evolution during the driving process. DWM methods enable autonomous driving systems to better perceive, understand, and interact with dynamic driving environments.
arXiv Detail & Related papers (2025-02-14T18:43:15Z)
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives [56.528835143531694]
We introduce DriveBench, a benchmark dataset designed to evaluate Vision-Language Models (VLMs) Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding. We propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding.
arXiv Detail & Related papers (2025-01-07T18:59:55Z)
Pedestrian motion prediction evaluation for urban autonomous driving [0.0]
We analyze selected publications with provided open-source solutions to determine valuability of traditional motion prediction metrics. This perspective should be valuable to any potential autonomous driving or robotics engineer looking for the real-world performance of the existing state-of-art pedestrian motion prediction problem.
arXiv Detail & Related papers (2024-10-22T10:06:50Z)
Exploring the Causality of End-to-End Autonomous Driving [57.631400236930375]
We propose a comprehensive approach to explore and analyze the causality of end-to-end autonomous driving. Our work is the first to unveil the mystery of end-to-end autonomous driving and turn the black box into a white one.
arXiv Detail & Related papers (2024-07-09T04:56:11Z)
Applications of Computer Vision in Autonomous Vehicles: Methods, Challenges and Future Directions [2.693342141713236]
This paper reviews publications on computer vision and autonomous driving that are published during the last ten years. In particular, we first investigate the development of autonomous driving systems and summarize these systems that are developed by the major automotive manufacturers from different countries. Then, a comprehensive overview of computer vision applications for autonomous driving such as depth estimation, object detection, lane detection, and traffic sign recognition are discussed.
arXiv Detail & Related papers (2023-11-15T16:41:18Z)
LLM4Drive: A Survey of Large Language Models for Autonomous Driving [62.10344445241105]
Large language models (LLMs) have demonstrated abilities including understanding context, logical reasoning, and generating answers. In this paper, we systematically review a research line about textitLarge Language Models for Autonomous Driving (LLM4AD).
arXiv Detail & Related papers (2023-11-02T07:23:33Z)
Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text. Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z)
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [84.29836263441136]
This study introduces DriveGPT4, a novel interpretable end-to-end autonomous driving system based on multimodal large language models (MLLMs) DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users.
arXiv Detail & Related papers (2023-10-02T17:59:52Z)
Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [28.957124302293966]
We explore the potential of using a large language model (LLM) to understand the driving environment in a human-like manner. Our experiments show that the LLM exhibits the impressive ability to reason and solve long-tailed cases.
arXiv Detail & Related papers (2023-07-14T05:18:34Z)
Exploring Contextual Representation and Multi-Modality for End-to-End Autonomous Driving [58.879758550901364]
Recent perception systems enhance spatial understanding with sensor fusion but often lack full environmental context. We introduce a framework that integrates three cameras to emulate the human field of view, coupled with top-down bird-eye-view semantic data to enhance contextual representation. Our method achieves displacement error by 0.67m in open-loop settings, surpassing current methods by 6.9% on the nuScenes dataset.
arXiv Detail & Related papers (2022-10-13T05:56:20Z)
Explainability of vision-based autonomous driving systems: Review and challenges [33.720369945541805]
The need for explainability is strong in driving, a safety-critical application. This survey gathers contributions from several research fields, namely computer vision, deep learning, autonomous driving, explainable AI (X-AI)
arXiv Detail & Related papers (2021-01-13T19:09:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.