Explaining Motion Relevance for Activity Recognition in Video Deep
Learning Models
- URL: http://arxiv.org/abs/2003.14285v1
- Date: Tue, 31 Mar 2020 15:19:04 GMT
- Title: Explaining Motion Relevance for Activity Recognition in Video Deep
Learning Models
- Authors: Liam Hiley and Alun Preece and Yulia Hicks and Supriyo Chakraborty and
Prudhvi Gurram and Richard Tomsett
- Abstract summary: A small subset of explainability techniques has been applied for interpretability of 3D Convolutional Neural Network models in activity recognition tasks.
We propose a selective relevance method for adapting the 2D explanation techniques to provide motion-specific explanations.
Our results show that the selective relevance method can not only provide insight on the role played by motion in the model's decision -- in effect, revealing and quantifying the model's spatial bias -- but the method also simplifies the resulting explanations for human consumption.
- Score: 12.807049446839507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A small subset of explainability techniques developed initially for image
recognition models has recently been applied for interpretability of 3D
Convolutional Neural Network models in activity recognition tasks. Much like
the models themselves, the techniques require little or no modification to be
compatible with 3D inputs. However, these explanation techniques regard spatial
and temporal information jointly. Therefore, using such explanation techniques,
a user cannot explicitly distinguish the role of motion in a 3D model's
decision. In fact, it has been shown that these models do not appropriately
factor motion information into their decision. We propose a selective relevance
method for adapting the 2D explanation techniques to provide motion-specific
explanations, better aligning them with the human understanding of motion as
conceptually separate from static spatial features. We demonstrate the utility
of our method in conjunction with several widely-used 2D explanation methods,
and show that it improves explanation selectivity for motion. Our results show
that the selective relevance method can not only provide insight on the role
played by motion in the model's decision -- in effect, revealing and
quantifying the model's spatial bias -- but the method also simplifies the
resulting explanations for human consumption.
Related papers
- Local Agnostic Video Explanations: a Study on the Applicability of
Removal-Based Explanations to Video [0.6906005491572401]
We present a unified framework for local explanations in the video domain.
Our contributions include: (1) Extending a fine-grained explanation framework tailored for computer vision data, (2) Adapting six existing explanation techniques to work on video data, and (3) Conducting an evaluation and comparison of the adapted explanation methods.
arXiv Detail & Related papers (2024-01-22T09:53:20Z) - Manipulating Feature Visualizations with Gradient Slingshots [54.31109240020007]
We introduce a novel method for manipulating Feature Visualization (FV) without significantly impacting the model's decision-making process.
We evaluate the effectiveness of our method on several neural network models and demonstrate its capabilities to hide the functionality of arbitrarily chosen neurons.
arXiv Detail & Related papers (2024-01-11T18:57:17Z) - Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion
Modeling [83.76377808476039]
We propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior.
Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton.
A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence.
arXiv Detail & Related papers (2023-08-18T16:41:57Z) - Learning Scene Flow With Skeleton Guidance For 3D Action Recognition [1.5954459915735735]
This work demonstrates the use of 3D flow sequence by a deeptemporal model for 3D action recognition.
An extended deep skeleton is also introduced to learn the most discriminant action motion dynamics.
A late fusion scheme is adopted between the two models for learning the high level cross-modal correlations.
arXiv Detail & Related papers (2023-06-23T04:14:25Z) - OCTET: Object-aware Counterfactual Explanations [29.532969342297086]
We propose an object-centric framework for counterfactual explanation generation.
Our method, inspired by recent generative modeling works, encodes the query image into a latent space that is structured to ease object-level manipulations.
We conduct a set of experiments on counterfactual explanation benchmarks for driving scenes, and we show that our method can be adapted beyond classification.
arXiv Detail & Related papers (2022-11-22T16:23:12Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - MoCaNet: Motion Retargeting in-the-wild via Canonicalization Networks [77.56526918859345]
We present a novel framework that brings the 3D motion task from controlled environments to in-the-wild scenarios.
It is capable of body motion from a character in a 2D monocular video to a 3D character without using any motion capture system or 3D reconstruction procedure.
arXiv Detail & Related papers (2021-12-19T07:52:05Z) - Gradient Frequency Modulation for Visually Explaining Video
Understanding Models [39.70146574042422]
We propose Frequency-based Extremal Perturbation (FEP) to explain a video understanding model's decisions.
We show in a range of experiments that FEP provides more faithfully represent the model's decisions compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2021-11-01T19:07:58Z) - HuMoR: 3D Human Motion Model for Robust Pose Estimation [100.55369985297797]
HuMoR is a 3D Human Motion Model for Robust Estimation of temporal pose and shape.
We introduce a conditional variational autoencoder, which learns a distribution of the change in pose at each step of a motion sequence.
We demonstrate that our model generalizes to diverse motions and body shapes after training on a large motion capture dataset.
arXiv Detail & Related papers (2021-05-10T21:04:55Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Residual Frames with Efficient Pseudo-3D CNN for Human Action
Recognition [10.185425416255294]
We propose to use residual frames as an alternative "lightweight" motion representation.
We also develop a new pseudo-3D convolution module which decouples 3D convolution into 2D and 1D convolution.
arXiv Detail & Related papers (2020-08-03T17:40:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.