Explainable Multi-Camera 3D Object Detection with Transformer-Based
Saliency Maps
- URL: http://arxiv.org/abs/2312.14606v1
- Date: Fri, 22 Dec 2023 11:03:12 GMT
- Title: Explainable Multi-Camera 3D Object Detection with Transformer-Based
Saliency Maps
- Authors: Till Beemelmanns, Wassim Zahr, Lutz Eckstein
- Abstract summary: Vision Transformers (ViTs) have achieved state-of-the-art results on various computer vision tasks, including 3D object detection.
End-to-end implementation makes ViTs less explainable, which can be a challenge for deploying them in safety-critical applications.
We propose a novel method for generating saliency maps for a DetR-like ViT with multiple camera inputs used for 3D object detection.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have achieved state-of-the-art results on various
computer vision tasks, including 3D object detection. However, their end-to-end
implementation also makes ViTs less explainable, which can be a challenge for
deploying them in safety-critical applications, such as autonomous driving,
where it is important for authorities, developers, and users to understand the
model's reasoning behind its predictions. In this paper, we propose a novel
method for generating saliency maps for a DetR-like ViT with multiple camera
inputs used for 3D object detection. Our method is based on the raw attention
and is more efficient than gradient-based methods. We evaluate the proposed
method on the nuScenes dataset using extensive perturbation tests and show that
it outperforms other explainability methods in terms of visual quality and
quantitative metrics. We also demonstrate the importance of aggregating
attention across different layers of the transformer. Our work contributes to
the development of explainable AI for ViTs, which can help increase trust in AI
applications by establishing more transparency regarding the inner workings of
AI models.
Related papers
- Divide and Conquer: Improving Multi-Camera 3D Perception with 2D Semantic-Depth Priors and Input-Dependent Queries [30.17281824826716]
Existing techniques often neglect the synergistic effects of semantic and depth cues, leading to classification and position estimation errors.
We propose an input-aware Transformer framework that leverages Semantics and Depth as priors.
Our approach involves the use of an S-D that explicitly models semantic and depth priors, thereby disentangling the learning process of object categorization and position estimation.
arXiv Detail & Related papers (2024-08-13T13:51:34Z) - Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision
Transformer Fusion [8.168523242105763]
We will introduce a novel vision transformer-based 3D object detection model, namely FusionViT.
Our FusionViT model can achieve state-of-the-art performance and outperforms existing baseline methods.
arXiv Detail & Related papers (2023-11-07T00:12:01Z) - HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for
Autonomous Driving [95.42203932627102]
3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians.
Our method efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin.
Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages.
arXiv Detail & Related papers (2022-12-15T11:15:14Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - Transformers in Self-Supervised Monocular Depth Estimation with Unknown
Camera Intrinsics [13.7258515433446]
Self-supervised monocular depth estimation is an important task in 3D scene understanding.
We show how to adapt vision transformers for self-supervised monocular depth estimation.
Our study demonstrates how transformer-based architecture achieves comparable performance while being more robust and generalizable.
arXiv Detail & Related papers (2022-02-07T13:17:29Z) - Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem.
We employ a Neural Message Passing network for data association that is fully trainable.
We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.