Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
- URL: http://arxiv.org/abs/2104.09224v1
- Date: Mon, 19 Apr 2021 11:48:13 GMT
- Title: Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
- Authors: Aditya Prakash, Kashyap Chitta, Andreas Geiger
- Abstract summary: We propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention.
Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
- Score: 59.60483620730437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How should representations from complementary sensors be integrated for
autonomous driving? Geometry-based sensor fusion has shown great promise for
perception tasks such as object detection and motion forecasting. However, for
the actual driving task, the global context of the 3D scene is key, e.g. a
change in traffic light state can affect the behavior of a vehicle
geometrically distant from that traffic light. Geometry alone may therefore be
insufficient for effectively fusing representations in end-to-end driving
models. In this work, we demonstrate that imitation learning policies based on
existing sensor fusion methods under-perform in the presence of a high density
of dynamic agents and complex scenarios, which require global contextual
reasoning, such as handling traffic oncoming from multiple directions at
uncontrolled intersections. Therefore, we propose TransFuser, a novel
Multi-Modal Fusion Transformer, to integrate image and LiDAR representations
using attention. We experimentally validate the efficacy of our approach in
urban settings involving complex scenarios using the CARLA urban driving
simulator. Our approach achieves state-of-the-art driving performance while
reducing collisions by 76% compared to geometry-based fusion.
Related papers
- GITSR: Graph Interaction Transformer-based Scene Representation for Multi Vehicle Collaborative Decision-making [9.910230703889956]
This study focuses on efficient scene representation and the modeling of spatial interaction behaviors of traffic states.
In this study, we propose GITSR, an effective framework for Graph Interaction Transformer-based Scene Representation.
arXiv Detail & Related papers (2024-11-03T15:27:26Z) - Graph-Based Interaction-Aware Multimodal 2D Vehicle Trajectory
Prediction using Diffusion Graph Convolutional Networks [17.989423104706397]
This study presents the Graph-based Interaction-aware Multi-modal Trajectory Prediction framework.
Within this framework, vehicles' motions are conceptualized as nodes in a time-varying graph, and the traffic interactions are represented by a dynamic adjacency matrix.
We employ a driving intention-specific feature fusion, enabling the adaptive integration of historical and future embeddings.
arXiv Detail & Related papers (2023-09-05T06:28:13Z) - Penalty-Based Imitation Learning With Cross Semantics Generation Sensor
Fusion for Autonomous Driving [1.2749527861829049]
In this paper, we provide a penalty-based imitation learning approach to integrate multiple modalities of information.
We observe a remarkable increase in the driving score by more than 12% when compared to the state-of-the-art (SOTA) model, InterFuser.
Our model achieves this performance enhancement while achieving a 7-fold increase in inference speed and reducing the model size by approximately 30%.
arXiv Detail & Related papers (2023-03-21T14:29:52Z) - Social Occlusion Inference with Vectorized Representation for Autonomous
Driving [0.0]
This paper introduces a novel social occlusion inference approach that learns a mapping from agent trajectories and scene context to an occupancy grid map (OGM) representing the view of ego vehicle.
To verify the performance of vectorized representation, we design a baseline based on a fully transformer encoder-decoder architecture.
We evaluate our approach on an unsignalized intersection in the INTERACTION dataset, which outperforms the state-of-the-art results.
arXiv Detail & Related papers (2023-03-18T10:44:39Z) - Generative AI-empowered Simulation for Autonomous Driving in Vehicular
Mixed Reality Metaverses [130.15554653948897]
In vehicular mixed reality (MR) Metaverse, distance between physical and virtual entities can be overcome.
Large-scale traffic and driving simulation via realistic data collection and fusion from the physical world is difficult and costly.
We propose an autonomous driving architecture, where generative AI is leveraged to synthesize unlimited conditioned traffic and driving data in simulations.
arXiv Detail & Related papers (2023-02-16T16:54:10Z) - Exploring Contextual Representation and Multi-Modality for End-to-End
Autonomous Driving [58.879758550901364]
Recent perception systems enhance spatial understanding with sensor fusion but often lack full environmental context.
We introduce a framework that integrates three cameras to emulate the human field of view, coupled with top-down bird-eye-view semantic data to enhance contextual representation.
Our method achieves displacement error by 0.67m in open-loop settings, surpassing current methods by 6.9% on the nuScenes dataset.
arXiv Detail & Related papers (2022-10-13T05:56:20Z) - TransFuser: Imitation with Transformer-Based Sensor Fusion for
Autonomous Driving [46.409930329699336]
We propose TransFuser, a mechanism to integrate image and LiDAR representations using self-attention.
Our approach uses transformer modules at multiple resolutions to fuse perspective view and bird's eye view feature maps.
We experimentally validate its efficacy on a challenging new benchmark with long routes and dense traffic, as well as the official leaderboard of the CARLA urban driving simulator.
arXiv Detail & Related papers (2022-05-31T17:57:19Z) - TrafficSim: Learning to Simulate Realistic Multi-Agent Behaviors [74.67698916175614]
We propose TrafficSim, a multi-agent behavior model for realistic traffic simulation.
In particular, we leverage an implicit latent variable model to parameterize a joint actor policy.
We show TrafficSim generates significantly more realistic and diverse traffic scenarios as compared to a diverse set of baselines.
arXiv Detail & Related papers (2021-01-17T00:29:30Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.