Cognitive TransFuser: Semantics-guided Transformer-based Sensor Fusion
for Improved Waypoint Prediction
- URL: http://arxiv.org/abs/2308.02126v2
- Date: Wed, 31 Jan 2024 10:36:02 GMT
- Title: Cognitive TransFuser: Semantics-guided Transformer-based Sensor Fusion
for Improved Waypoint Prediction
- Authors: Hwan-Soo Choi, Jongoh Jeong, Young Hoo Cho, Kuk-Jin Yoon, and
Jong-Hwan Kim
- Abstract summary: RGB-LIDAR-based multi-task feature fusion network, coined Cognitive TransFuser, augments and exceeds the baseline network by a significant margin for safer and more complete road navigation.
We validate the proposed network on the Town05 Short and Town05 Long Benchmark through extensive experiments, achieving up to 44.2 FPS real-time inference time.
- Score: 38.971222477695214
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sensor fusion approaches for intelligent self-driving agents remain key to
driving scene understanding given visual global contexts acquired from input
sensors. Specifically, for the local waypoint prediction task, single-modality
networks are still limited by strong dependency on the sensitivity of the input
sensor, and thus recent works therefore promote the use of multiple sensors in
fusion in feature level in practice. While it is well known that multiple data
modalities encourage mutual contextual exchange, it requires global 3D scene
understanding in real-time with minimal computation upon deployment to
practical driving scenarios, thereby placing greater significance on the
training strategy given a limited number of practically usable sensors. In this
light, we exploit carefully selected auxiliary tasks that are highly correlated
with the target task of interest (e.g., traffic light recognition and semantic
segmentation) by fusing auxiliary task features and also using auxiliary heads
for waypoint prediction based on imitation learning. Our RGB-LIDAR-based
multi-task feature fusion network, coined Cognitive TransFuser, augments and
exceeds the baseline network by a significant margin for safer and more
complete road navigation in the CARLA simulator. We validate the proposed
network on the Town05 Short and Town05 Long Benchmark through extensive
experiments, achieving up to 44.2 FPS real-time inference time.
Related papers
- Semantic Communication for Cooperative Perception using HARQ [51.148203799109304]
We leverage an importance map to distill critical semantic information, introducing a cooperative perception semantic communication framework.
To counter the challenges posed by time-varying multipath fading, our approach incorporates the use of frequency-division multiplexing (OFDM) along with channel estimation and equalization strategies.
We introduce a novel semantic error detection method that is integrated with our semantic communication framework in the spirit of hybrid automatic repeated request (HARQ)
arXiv Detail & Related papers (2024-08-29T08:53:26Z) - Deep Learning-Based Robust Multi-Object Tracking via Fusion of mmWave Radar and Camera Sensors [6.166992288822812]
Multi-Object Tracking plays a critical role in ensuring safer and more efficient navigation through complex traffic scenarios.
This paper presents a novel deep learning-based method that integrates radar and camera data to enhance the accuracy and robustness of Multi-Object Tracking in autonomous driving systems.
arXiv Detail & Related papers (2024-07-10T21:09:09Z) - Efficient Fusion and Task Guided Embedding for End-to-end Autonomous Driving [1.3149617027696827]
We introduce a compact yet potent solution named EfficientFuser to address the challenges of sensor fusion and safety risk prediction.
Evaluated on the CARLA simulation platform, EfficientFuser demonstrates remarkable efficiency, utilizing merely 37.6% of the parameters.
The safety score neared that of the leading safety-enhanced method, showcasing its efficacy and potential for practical deployment in autonomous driving systems.
arXiv Detail & Related papers (2024-07-03T07:45:58Z) - Unsupervised Domain Adaptation for Self-Driving from Past Traversal
Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments.
Our approach enhances LiDAR-based detection models using spatial quantized historical features.
Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z) - Exploring Contextual Representation and Multi-Modality for End-to-End
Autonomous Driving [58.879758550901364]
Recent perception systems enhance spatial understanding with sensor fusion but often lack full environmental context.
We introduce a framework that integrates three cameras to emulate the human field of view, coupled with top-down bird-eye-view semantic data to enhance contextual representation.
Our method achieves displacement error by 0.67m in open-loop settings, surpassing current methods by 6.9% on the nuScenes dataset.
arXiv Detail & Related papers (2022-10-13T05:56:20Z) - Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion
Transformer [28.15612357340141]
We propose a safety-enhanced autonomous driving framework, named Interpretable Sensor Fusion Transformer(InterFuser)
We process and fuse information from multi-modal multi-view sensors for achieving comprehensive scene understanding and adversarial event detection.
Our framework provides more semantics and are exploited to better constrain actions to be within the safe sets.
arXiv Detail & Related papers (2022-07-28T11:36:21Z) - Bayesian Imitation Learning for End-to-End Mobile Manipulation [80.47771322489422]
Augmenting policies with additional sensor inputs, such as RGB + depth cameras, is a straightforward approach to improving robot perception capabilities.
We show that using the Variational Information Bottleneck to regularize convolutional neural networks improves generalization to held-out domains.
We demonstrate that our method is able to help close the sim-to-real gap and successfully fuse RGB and depth modalities.
arXiv Detail & Related papers (2022-02-15T17:38:30Z) - Plants Don't Walk on the Street: Common-Sense Reasoning for Reliable
Semantic Segmentation [0.7696728525672148]
We propose to use a partly human-designed, partly learned set of rules to describe relations between objects of a traffic scene on a high level of abstraction.
In doing so, we improve and robustify existing deep neural networks consuming low-level sensor information.
arXiv Detail & Related papers (2021-04-19T12:51:06Z) - Lite-HDSeg: LiDAR Semantic Segmentation Using Lite Harmonic Dense
Convolutions [2.099922236065961]
We present Lite-HDSeg, a novel real-time convolutional neural network for semantic segmentation of full $3$D LiDAR point clouds.
Our experimental results show that the proposed method outperforms state-of-the-art semantic segmentation approaches which can run real-time.
arXiv Detail & Related papers (2021-03-16T04:54:57Z) - IntentNet: Learning to Predict Intention from Raw Sensor Data [86.74403297781039]
In this paper, we develop a one-stage detector and forecaster that exploits both 3D point clouds produced by a LiDAR sensor as well as dynamic maps of the environment.
Our multi-task model achieves better accuracy than the respective separate modules while saving computation, which is critical to reducing reaction time in self-driving applications.
arXiv Detail & Related papers (2021-01-20T00:31:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.