A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying
Static vs. Dynamic Information
- URL: http://arxiv.org/abs/2206.02846v1
- Date: Mon, 6 Jun 2022 18:39:37 GMT
- Title: A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying
Static vs. Dynamic Information
- Authors: Matthew Kowal, Mennatullah Siam, Md Amirul Islam, Neil D. B. Bruce,
Richard P. Wildes, Konstantinos G. Derpanis
- Abstract summary: We analyse two widely studied tasks, action recognition and video object segmentation.
Most examined models are biased toward static information.
Certain two-stream architectures with cross-connections show a better balance between the static and dynamic information captured.
- Score: 34.595367958746856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep spatiotemporal models are used in a variety of computer vision tasks,
such as action recognition and video object segmentation. Currently, there is a
limited understanding of what information is captured by these models in their
intermediate representations. For example, while it has been observed that
action recognition algorithms are heavily influenced by visual appearance in
single static frames, there is no quantitative methodology for evaluating such
static bias in the latent representation compared to bias toward dynamic
information (e.g. motion). We tackle this challenge by proposing a novel
approach for quantifying the static and dynamic biases of any spatiotemporal
model. To show the efficacy of our approach, we analyse two widely studied
tasks, action recognition and video object segmentation. Our key findings are
threefold: (i) Most examined spatiotemporal models are biased toward static
information; although, certain two-stream architectures with cross-connections
show a better balance between the static and dynamic information captured. (ii)
Some datasets that are commonly assumed to be biased toward dynamics are
actually biased toward static information. (iii) Individual units (channels) in
an architecture can be biased toward static, dynamic or a combination of the
two.
Related papers
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - Quantifying and Learning Static vs. Dynamic Information in Deep Spatiotemporal Networks [29.47784194895489]
Action recognition, automatic video object segmentation (AVOS) and video instance segmentation (VIS) are studied.
Most examined models are biased toward static information.
Some datasets that are assumed to be biased toward dynamics are actually biased toward static information.
arXiv Detail & Related papers (2022-11-03T13:17:53Z) - DyTed: Disentangled Representation Learning for Discrete-time Dynamic
Graph [59.583555454424]
We propose a novel disenTangled representation learning framework for discrete-time Dynamic graphs, namely DyTed.
We specially design a temporal-clips contrastive learning task together with a structure contrastive learning to effectively identify the time-invariant and time-varying representations respectively.
arXiv Detail & Related papers (2022-10-19T14:34:12Z) - STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic
Cross-Modal Understanding [68.96574451918458]
We propose a framework named STVG, which models visual-linguistic dependencies with a static branch and a dynamic branch.
Both the static and dynamic branches are designed as cross-modal transformers.
Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG of the Person in Context Challenge.
arXiv Detail & Related papers (2022-07-06T15:48:58Z) - Learning Interacting Dynamical Systems with Latent Gaussian Process ODEs [13.436770170612295]
We study for the first time uncertainty-aware modeling of continuous-time dynamics of interacting objects.
Our model infers both independent dynamics and their interactions with reliable uncertainty estimates.
arXiv Detail & Related papers (2022-05-24T08:36:25Z) - MoCo-Flow: Neural Motion Consensus Flow for Dynamic Humans in Stationary
Monocular Cameras [98.40768911788854]
We introduce MoCo-Flow, a representation that models the dynamic scene using a 4D continuous time-variant function.
At the heart of our work lies a novel optimization formulation, which is constrained by a motion consensus regularization on the motion flow.
We extensively evaluate MoCo-Flow on several datasets that contain human motions of varying complexity.
arXiv Detail & Related papers (2021-06-08T16:03:50Z) - TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning [87.38675639186405]
We propose a novel graph neural network approach, called TCL, which deals with the dynamically-evolving graph in a continuous-time fashion.
To the best of our knowledge, this is the first attempt to apply contrastive learning to representation learning on dynamic graphs.
arXiv Detail & Related papers (2021-05-17T15:33:25Z) - A Gated Fusion Network for Dynamic Saliency Prediction [16.701214795454536]
Gated Fusion Network for dynamic saliency (GFSalNet)
GFSalNet is first deep saliency model capable of making predictions in a dynamic way via gated fusion mechanism.
We show that it has a good generalization ability, and moreover, exploits temporal information more effectively via its adaptive fusion scheme.
arXiv Detail & Related papers (2021-02-15T17:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.