TransFuser: Imitation with Transformer-Based Sensor Fusion for
Autonomous Driving
- URL: http://arxiv.org/abs/2205.15997v1
- Date: Tue, 31 May 2022 17:57:19 GMT
- Title: TransFuser: Imitation with Transformer-Based Sensor Fusion for
Autonomous Driving
- Authors: Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin
Renz, Andreas Geiger
- Abstract summary: We propose TransFuser, a mechanism to integrate image and LiDAR representations using self-attention.
Our approach uses transformer modules at multiple resolutions to fuse perspective view and bird's eye view feature maps.
We experimentally validate its efficacy on a challenging new benchmark with long routes and dense traffic, as well as the official leaderboard of the CARLA urban driving simulator.
- Score: 46.409930329699336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How should we integrate representations from complementary sensors for
autonomous driving? Geometry-based fusion has shown promise for perception
(e.g. object detection, motion forecasting). However, in the context of
end-to-end driving, we find that imitation learning based on existing sensor
fusion methods underperforms in complex driving scenarios with a high density
of dynamic agents. Therefore, we propose TransFuser, a mechanism to integrate
image and LiDAR representations using self-attention. Our approach uses
transformer modules at multiple resolutions to fuse perspective view and bird's
eye view feature maps. We experimentally validate its efficacy on a challenging
new benchmark with long routes and dense traffic, as well as the official
leaderboard of the CARLA urban driving simulator. At the time of submission,
TransFuser outperforms all prior work on the CARLA leaderboard in terms of
driving score by a large margin. Compared to geometry-based fusion, TransFuser
reduces the average collisions per kilometer by 48%.
Related papers
- Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes [56.52618054240197]
We propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes.
Our method, CAFuser, uses an RGB camera input to classify environmental conditions and generate a Condition Token that guides the fusion of multiple sensor modalities.
We set the new state of the art with CAFuser on the MUSES dataset with 59.7 PQ for multimodal panoptic segmentation and 78.2 mIoU for semantic segmentation, ranking first on the public benchmarks.
arXiv Detail & Related papers (2024-10-14T17:56:20Z) - M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving [11.36165122994834]
We propose a Multi-Modal fusion transformer incorporating Driver Attention (M2DA) for autonomous driving.
By incorporating driver attention, we empower the human-like scene understanding ability to autonomous vehicles to identify crucial areas precisely and ensure safety.
arXiv Detail & Related papers (2024-03-19T08:54:52Z) - G-MEMP: Gaze-Enhanced Multimodal Ego-Motion Prediction in Driving [71.9040410238973]
We focus on inferring the ego trajectory of a driver's vehicle using their gaze data.
Next, we develop G-MEMP, a novel multimodal ego-trajectory prediction network that combines GPS and video input with gaze data.
The results show that G-MEMP significantly outperforms state-of-the-art methods in both benchmarks.
arXiv Detail & Related papers (2023-12-13T23:06:30Z) - Sensor Fusion by Spatial Encoding for Autonomous Driving [1.319058156672392]
We introduce a method for fusing data from camera and LiDAR.
By employing Transformer modules at multiple resolutions, proposed method effectively combines local and global contextual relationships.
The proposed method outperforms previous approaches with the most challenging benchmarks.
arXiv Detail & Related papers (2023-08-17T04:12:02Z) - Penalty-Based Imitation Learning With Cross Semantics Generation Sensor
Fusion for Autonomous Driving [1.2749527861829049]
In this paper, we provide a penalty-based imitation learning approach to integrate multiple modalities of information.
We observe a remarkable increase in the driving score by more than 12% when compared to the state-of-the-art (SOTA) model, InterFuser.
Our model achieves this performance enhancement while achieving a 7-fold increase in inference speed and reducing the model size by approximately 30%.
arXiv Detail & Related papers (2023-03-21T14:29:52Z) - Exploring Contextual Representation and Multi-Modality for End-to-End
Autonomous Driving [58.879758550901364]
Recent perception systems enhance spatial understanding with sensor fusion but often lack full environmental context.
We introduce a framework that integrates three cameras to emulate the human field of view, coupled with top-down bird-eye-view semantic data to enhance contextual representation.
Our method achieves displacement error by 0.67m in open-loop settings, surpassing current methods by 6.9% on the nuScenes dataset.
arXiv Detail & Related papers (2022-10-13T05:56:20Z) - Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [59.60483620730437]
We propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention.
Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
arXiv Detail & Related papers (2021-04-19T11:48:13Z) - Towards Autonomous Driving: a Multi-Modal 360$^{\circ}$ Perception
Proposal [87.11988786121447]
This paper presents a framework for 3D object detection and tracking for autonomous vehicles.
The solution, based on a novel sensor fusion configuration, provides accurate and reliable road environment detection.
A variety of tests of the system, deployed in an autonomous vehicle, have successfully assessed the suitability of the proposed perception stack.
arXiv Detail & Related papers (2020-08-21T20:36:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.