Adaptive occlusion sensitivity analysis for visually explaining video
recognition networks
- URL: http://arxiv.org/abs/2207.12859v2
- Date: Thu, 17 Aug 2023 08:36:46 GMT
- Title: Adaptive occlusion sensitivity analysis for visually explaining video
recognition networks
- Authors: Tomoki Uchiyama, Naoya Sogi, Satoshi Iizuka, Koichiro Niinuma,
Kazuhiro Fukui
- Abstract summary: Occlusion sensitivity analysis is commonly used to analyze single image classification.
This paper proposes a method for visually explaining the decision-making process of video recognition networks.
- Score: 12.75077781554099
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a method for visually explaining the decision-making
process of video recognition networks with a temporal extension of occlusion
sensitivity analysis, called Adaptive Occlusion Sensitivity Analysis (AOSA).
The key idea here is to occlude a specific volume of data by a 3D mask in an
input 3D temporal-spatial data space and then measure the change degree in the
output score. The occluded volume data that produces a larger change degree is
regarded as a more critical element for classification. However, while the
occlusion sensitivity analysis is commonly used to analyze single image
classification, applying this idea to video classification is not so
straightforward as a simple fixed cuboid cannot deal with complicated motions.
To solve this issue, we adaptively set the shape of a 3D occlusion mask while
referring to motions. Our flexible mask adaptation is performed by considering
the temporal continuity and spatial co-occurrence of the optical flows
extracted from the input video data. We further propose a novel method to
reduce the computational cost of the proposed method with the first-order
approximation of the output score with respect to an input video. We
demonstrate the effectiveness of our method through various and extensive
comparisons with the conventional methods in terms of the deletion/insertion
metric and the pointing metric on the UCF101 dataset and the Kinetics-400 and
700 datasets.
Related papers
- NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows [60.291277312569285]
We present a method for automatically modifying a NeRF representation based on a single observation.
Our method defines the transformation as a 3D flow, specifically as a weighted linear blending of rigid transformations.
We also introduce a new dataset for exploring the problem of modifying a NeRF scene through a single observation.
arXiv Detail & Related papers (2024-06-15T07:58:08Z) - Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe.
Recent deep learning-based VAD models have shown promising results by generating high-resolution frames.
We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z) - Variance-insensitive and Target-preserving Mask Refinement for
Interactive Image Segmentation [68.16510297109872]
Point-based interactive image segmentation can ease the burden of mask annotation in applications such as semantic segmentation and image editing.
We introduce a novel method, Variance-Insensitive and Target-Preserving Mask Refinement to enhance segmentation quality with fewer user inputs.
Experiments on GrabCut, Berkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art performance in interactive image segmentation.
arXiv Detail & Related papers (2023-12-22T02:31:31Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Match and Locate: low-frequency monocular odometry based on deep feature
matching [0.65268245109828]
We introduce a novel approach for the robotic odometry which only requires a single camera.
The approach is based on matching image features between the consecutive frames of the video stream using deep feature matching models.
We evaluate the performance of the approach in the AISG-SLA Visual Localisation Challenge and find that while being computationally efficient and easy to implement our method shows competitive results.
arXiv Detail & Related papers (2023-11-16T17:32:58Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Mixed Reality Depth Contour Occlusion Using Binocular Similarity
Matching and Three-dimensional Contour Optimisation [3.9692358105634384]
Mixed reality applications often require virtual objects that are partly occluded by real objects.
Previous research and commercial products have limitations in terms of performance and efficiency.
arXiv Detail & Related papers (2022-03-04T13:16:40Z) - Weakly Supervised Instance Segmentation using Motion Information via
Optical Flow [3.0763099528432263]
We propose a two-stream encoder that leverages appearance and motion features extracted from images and optical flows.
Our results demonstrate that the proposed method improves the Average Precision of the state-of-the-art method by 3.1.
arXiv Detail & Related papers (2022-02-25T22:41:54Z) - SiamPolar: Semi-supervised Realtime Video Object Segmentation with Polar
Representation [6.108508667949229]
We propose a semi-supervised real-time method based on the Siamese network using a new polar representation.
The polar representation could reduce the parameters for encoding masks with subtle accuracy loss.
An asymmetric siamese network is also developed to extract the features from different spatial scales.
arXiv Detail & Related papers (2021-10-27T21:10:18Z) - Weakly-supervised Learning For Catheter Segmentation in 3D Frustum
Ultrasound [74.22397862400177]
We propose a novel Frustum ultrasound based catheter segmentation method.
The proposed method achieved the state-of-the-art performance with an efficiency of 0.25 second per volume.
arXiv Detail & Related papers (2020-10-19T13:56:22Z) - Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image.
Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space.
We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step.
This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.