Learning Self-Similarity in Space and Time as Generalized Motion for
Action Recognition
- URL: http://arxiv.org/abs/2102.07092v1
- Date: Sun, 14 Feb 2021 07:32:55 GMT
- Title: Learning Self-Similarity in Space and Time as Generalized Motion for
Action Recognition
- Authors: Heeseung Kwon, Manjin Kim, Suha Kwak, Minsu Cho
- Abstract summary: We propose a rich motion representation based on video self-similarity (STSS)
We leverage the whole volume of STSSS and let our model learn to extract an effective motion representation from it.
The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision.
- Score: 42.175450800733785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatio-temporal convolution often fails to learn motion dynamics in videos
and thus an effective motion representation is required for video understanding
in the wild. In this paper, we propose a rich and robust motion representation
based on spatio-temporal self-similarity (STSS). Given a sequence of frames,
STSS represents each local region as similarities to its neighbors in space and
time. By converting appearance features into relational values, it enables the
learner to better recognize structural patterns in space and time. We leverage
the whole volume of STSS and let our model learn to extract an effective motion
representation from it. The proposed neural block, dubbed SELFY, can be easily
inserted into neural architectures and trained end-to-end without additional
supervision. With a sufficient volume of the neighborhood in space and time, it
effectively captures long-term interaction and fast motion in the video,
leading to robust action recognition. Our experimental analysis demonstrates
its superiority over previous methods for motion modeling as well as its
complementarity to spatio-temporal features from direct convolution. On the
standard action recognition benchmarks, Something-Something-V1 & V2, Diving-48,
and FineGym, the proposed method achieves the state-of-the-art results.
Related papers
- Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications.
Traditional methods rely on hand-crafted features and machine learning techniques.
We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Spatio-temporal Tendency Reasoning for Human Body Pose and Shape
Estimation from Videos [10.50306784245168]
We present atemporal tendency reasoning (STR) network for recovering human body pose shape from videos.
Our STR aims to learn accurate and spatial motion sequences in an unconstrained environment.
Our STR remains competitive with the state-of-the-art on three datasets.
arXiv Detail & Related papers (2022-10-07T16:09:07Z) - Behavior Recognition Based on the Integration of Multigranular Motion
Features [17.052997301790693]
We propose a novel behavior recognition method based on the integration of multigranular (IMG) motion features.
We evaluate our model on several action recognition benchmarks such as HMDB51, Something-Something and UCF101.
arXiv Detail & Related papers (2022-03-07T02:05:26Z) - Spatiotemporal Inconsistency Learning for DeepFake Video Detection [51.747219106855624]
We present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions.
And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation.
arXiv Detail & Related papers (2021-09-04T13:05:37Z) - TSI: Temporal Saliency Integration for Video Action Recognition [32.18535820790586]
We propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module.
SME aims to highlight the motion-sensitive area through local-global motion modeling.
CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively.
arXiv Detail & Related papers (2021-06-02T11:43:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.