Exploring Contextual Representation and Multi-Modality for End-to-End
Autonomous Driving
- URL: http://arxiv.org/abs/2210.06758v2
- Date: Tue, 16 Jan 2024 23:54:43 GMT
- Title: Exploring Contextual Representation and Multi-Modality for End-to-End
Autonomous Driving
- Authors: Shoaib Azam, Farzeen Munir, Ville Kyrki, Moongu Jeon, and Witold
Pedrycz
- Abstract summary: Recent perception systems enhance spatial understanding with sensor fusion but often lack full environmental context.
We introduce a framework that integrates three cameras to emulate the human field of view, coupled with top-down bird-eye-view semantic data to enhance contextual representation.
Our method achieves displacement error by 0.67m in open-loop settings, surpassing current methods by 6.9% on the nuScenes dataset.
- Score: 58.879758550901364
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning contextual and spatial environmental representations enhances
autonomous vehicle's hazard anticipation and decision-making in complex
scenarios. Recent perception systems enhance spatial understanding with sensor
fusion but often lack full environmental context. Humans, when driving,
naturally employ neural maps that integrate various factors such as historical
data, situational subtleties, and behavioral predictions of other road users to
form a rich contextual understanding of their surroundings. This neural
map-based comprehension is integral to making informed decisions on the road.
In contrast, even with their significant advancements, autonomous systems have
yet to fully harness this depth of human-like contextual understanding.
Motivated by this, our work draws inspiration from human driving patterns and
seeks to formalize the sensor fusion approach within an end-to-end autonomous
driving framework. We introduce a framework that integrates three cameras
(left, right, and center) to emulate the human field of view, coupled with
top-down bird-eye-view semantic data to enhance contextual representation. The
sensor data is fused and encoded using a self-attention mechanism, leading to
an auto-regressive waypoint prediction module. We treat feature representation
as a sequential problem, employing a vision transformer to distill the
contextual interplay between sensor modalities. The efficacy of the proposed
method is experimentally evaluated in both open and closed-loop settings. Our
method achieves displacement error by 0.67m in open-loop settings, surpassing
current methods by 6.9% on the nuScenes dataset. In closed-loop evaluations on
CARLA's Town05 Long and Longest6 benchmarks, the proposed method enhances
driving performance, route completion, and reduces infractions.
Related papers
- RainSD: Rain Style Diversification Module for Image Synthesis
Enhancement using Feature-Level Style Distribution [5.500457283114346]
This paper presents a synthetic road dataset with sensor blockage generated from real road dataset BDD100K.
Using this dataset, the degradation of diverse multi-task networks for autonomous driving has been thoroughly evaluated and analyzed.
The tendency of the performance degradation of deep neural network-based perception systems for autonomous vehicle has been analyzed in depth.
arXiv Detail & Related papers (2023-12-31T11:30:42Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - Decision Making for Autonomous Driving in Interactive Merge Scenarios
via Learning-based Prediction [39.48631437946568]
This paper focuses on the complex task of merging into moving traffic where uncertainty emanates from the behavior of other drivers.
We frame the problem as a partially observable Markov decision process (POMDP) and solve it online with Monte Carlo tree search.
The solution to the POMDP is a policy that performs high-level driving maneuvers, such as giving way to an approaching car, keeping a safe distance from the vehicle in front or merging into traffic.
arXiv Detail & Related papers (2023-03-29T16:12:45Z) - Penalty-Based Imitation Learning With Cross Semantics Generation Sensor
Fusion for Autonomous Driving [1.2749527861829049]
In this paper, we provide a penalty-based imitation learning approach to integrate multiple modalities of information.
We observe a remarkable increase in the driving score by more than 12% when compared to the state-of-the-art (SOTA) model, InterFuser.
Our model achieves this performance enhancement while achieving a 7-fold increase in inference speed and reducing the model size by approximately 30%.
arXiv Detail & Related papers (2023-03-21T14:29:52Z) - Policy Pre-training for End-to-end Autonomous Driving via
Self-supervised Geometric Modeling [96.31941517446859]
We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving.
We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos.
In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input.
In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
arXiv Detail & Related papers (2023-01-03T08:52:49Z) - Towards Optimal Strategies for Training Self-Driving Perception Models
in Simulation [98.51313127382937]
We focus on the use of labels in the synthetic domain alone.
Our approach introduces both a way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator.
We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data.
arXiv Detail & Related papers (2021-11-15T18:37:43Z) - Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [59.60483620730437]
We propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention.
Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
arXiv Detail & Related papers (2021-04-19T11:48:13Z) - IntentNet: Learning to Predict Intention from Raw Sensor Data [86.74403297781039]
In this paper, we develop a one-stage detector and forecaster that exploits both 3D point clouds produced by a LiDAR sensor as well as dynamic maps of the environment.
Our multi-task model achieves better accuracy than the respective separate modules while saving computation, which is critical to reducing reaction time in self-driving applications.
arXiv Detail & Related papers (2021-01-20T00:31:52Z) - End-to-end Autonomous Driving Perception with Sequential Latent
Representation Learning [34.61415516112297]
An end-to-end approach might clean up the system and avoid huge efforts of human engineering.
A latent space is introduced to capture all relevant features useful for perception, which is learned through sequential latent representation learning.
The learned end-to-end perception model is able to solve the detection, tracking, localization and mapping problems altogether with only minimum human engineering efforts.
arXiv Detail & Related papers (2020-03-21T05:37:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.