LeTFuser: Light-weight End-to-end Transformer-Based Sensor Fusion for
Autonomous Driving with Multi-Task Learning
- URL: http://arxiv.org/abs/2310.13135v3
- Date: Fri, 1 Dec 2023 19:59:29 GMT
- Title: LeTFuser: Light-weight End-to-end Transformer-Based Sensor Fusion for
Autonomous Driving with Multi-Task Learning
- Authors: Pedram Agand, Mohammad Mahdavian, Manolis Savva, Mo Chen
- Abstract summary: We introduce LeTFuser, an algorithm for fusing multiple RGB-D camera representations.
To perform perception and control tasks simultaneously, we utilize multi-task learning.
- Score: 16.241116794114525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In end-to-end autonomous driving, the utilization of existing sensor fusion
techniques and navigational control methods for imitation learning proves
inadequate in challenging situations that involve numerous dynamic agents. To
address this issue, we introduce LeTFuser, a lightweight transformer-based
algorithm for fusing multiple RGB-D camera representations. To perform
perception and control tasks simultaneously, we utilize multi-task learning.
Our model comprises of two modules, the first being the perception module that
is responsible for encoding the observation data obtained from the RGB-D
cameras. Our approach employs the Convolutional vision Transformer (CvT)
\cite{wu2021cvt} to better extract and fuse features from multiple RGB cameras
due to local and global feature extraction capability of convolution and
transformer modules, respectively. Encoded features combined with static and
dynamic environments are later employed by our control module to predict
waypoints and vehicular controls (e.g. steering, throttle, and brake). We use
two methods to generate the vehicular controls levels. The first method uses a
PID algorithm to follow the waypoints on the fly, whereas the second one
directly predicts the control policy using the measurement features and
environmental state. We evaluate the model and conduct a comparative analysis
with recent models on the CARLA simulator using various scenarios, ranging from
normal to adversarial conditions, to simulate real-world scenarios. Our method
demonstrated better or comparable results with respect to our baselines in term
of driving abilities. The code is available at
\url{https://github.com/pagand/e2etransfuser/tree/cvpr-w} to facilitate future
studies.
Related papers
- Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Tactics2D: A Highly Modular and Extensible Simulator for Driving Decision-making [24.795867304772404]
Existing simulators often fall short in diverse scenarios or interactive behavior models for traffic participants.
Tactics2D adopts a modular approach to traffic scenario construction, encompassing road elements, traffic regulations, behavior models, physics simulations for vehicles, and event detection mechanisms.
Users can effectively evaluate the performance of driving decision-making models across various scenarios by leveraging both public datasets and user-collected real-world data.
arXiv Detail & Related papers (2023-11-18T12:31:34Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - M$^2$DAR: Multi-View Multi-Scale Driver Action Recognition with Vision
Transformer [5.082919518353888]
We present a multi-view, multi-scale framework for naturalistic driving action recognition and localization in untrimmed videos.
Our system features a weight-sharing, multi-scale Transformer-based action recognition network that learns robust hierarchical representations.
arXiv Detail & Related papers (2023-05-13T02:38:15Z) - PSNet: Parallel Symmetric Network for Video Salient Object Detection [85.94443548452729]
We propose a VSOD network with up and down parallel symmetry, named PSNet.
Two parallel branches with different dominant modalities are set to achieve complete video saliency decoding.
arXiv Detail & Related papers (2022-10-12T04:11:48Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Fully End-to-end Autonomous Driving with Semantic Depth Cloud Mapping
and Multi-Agent [2.512827436728378]
We propose a novel deep learning model trained with end-to-end and multi-task learning manners to perform both perception and control tasks simultaneously.
The model is evaluated on CARLA simulator with various scenarios made of normal-adversarial situations and different weathers to mimic real-world conditions.
arXiv Detail & Related papers (2022-04-12T03:57:01Z) - Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [59.60483620730437]
We propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention.
Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
arXiv Detail & Related papers (2021-04-19T11:48:13Z) - A Driving Behavior Recognition Model with Bi-LSTM and Multi-Scale CNN [59.57221522897815]
We propose a neural network model based on trajectories information for driving behavior recognition.
We evaluate the proposed model on the public BLVD dataset, achieving a satisfying performance.
arXiv Detail & Related papers (2021-03-01T06:47:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.