Spatial-temporal Concept based Explanation of 3D ConvNets
- URL: http://arxiv.org/abs/2206.05275v1
- Date: Thu, 9 Jun 2022 08:04:46 GMT
- Title: Spatial-temporal Concept based Explanation of 3D ConvNets
- Authors: Ying Ji, Yu Wang, Kensaku Mori, Jien Kato
- Abstract summary: We present a 3D ACE (Automatic Concept-based Explanation) framework for interpreting 3D ConvNets.
In our approach, videos are represented using high-level supervoxels, which is straightforward for human to understand.
Experiments show that our method can discover spatial-temporal concepts of different importance-levels.
- Score: 5.461115214431218
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have achieved outstanding success in explaining 2D image
recognition ConvNets. On the other hand, due to the computation cost and
complexity of video data, the explanation of 3D video recognition ConvNets is
relatively less studied. In this paper, we present a 3D ACE (Automatic
Concept-based Explanation) framework for interpreting 3D ConvNets. In our
approach: (1) videos are represented using high-level supervoxels, which is
straightforward for human to understand; and (2) the interpreting framework
estimates a score for each voxel, which reflects its importance in the decision
procedure. Experiments show that our method can discover spatial-temporal
concepts of different importance-levels, and thus can explore the influence of
the concepts on a target task, such as action classification, in-depth. The
codes are publicly available.
Related papers
- Escaping Plato's Cave: Robust Conceptual Reasoning through Interpretable 3D Neural Object Volumes [65.63534641857476]
We introduce CAVE - Concept Aware Volumes for Explanations - a new direction that unifies interpretability and robustness in image classification.
We design an inherently-interpretable and robust classifier by extending existing 3D-aware classifiers with concepts extracted from their volumetric representations for classification.
In an array of quantitative metrics for interpretability, we compare against different concept-based approaches across the explainable AI literature and show that CAVE discovers well-grounded concepts that are used consistently across images, while achieving superior robustness.
arXiv Detail & Related papers (2025-03-17T17:55:15Z) - Learning-based Multi-View Stereo: A Survey [55.3096230732874]
Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments.
With the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods.
arXiv Detail & Related papers (2024-08-27T17:53:18Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - A Review and A Robust Framework of Data-Efficient 3D Scene Parsing with
Traditional/Learned 3D Descriptors [10.497309421830671]
Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner.
This work presents a general and simple framework to tackle point cloud understanding when labels are limited.
arXiv Detail & Related papers (2023-12-03T02:51:54Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic
Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames.
Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - 3D Concept Grounding on Neural Fields [99.33215488324238]
Existing visual reasoning approaches typically utilize supervised methods to extract 2D segmentation masks on which concepts are grounded.
Humans are capable of grounding concepts on the underlying 3D representation of images.
We propose to leverage the continuous, differentiable nature of neural fields to segment and learn concepts.
arXiv Detail & Related papers (2022-07-13T17:59:33Z) - Unsupervised Learning of Visual 3D Keypoints for Control [104.92063943162896]
Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations.
We propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner.
These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space.
arXiv Detail & Related papers (2021-06-14T17:59:59Z) - Making a Case for 3D Convolutions for Object Segmentation in Videos [16.167397418720483]
We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation.
We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules.
Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
arXiv Detail & Related papers (2020-08-26T12:24:23Z) - Interpreting video features: a comparison of 3D convolutional networks
and convolutional LSTM networks [1.462434043267217]
We compare how 3D convolutional networks and convolutional LSTM networks learn features across temporally dependent frames.
Our findings indicate that the 3D convolutional model concentrates on shorter events in the input sequence, and places its spatial focus on fewer, contiguous areas.
arXiv Detail & Related papers (2020-02-02T11:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.