Behavior Recognition Based on the Integration of Multigranular Motion
Features
- URL: http://arxiv.org/abs/2203.03097v1
- Date: Mon, 7 Mar 2022 02:05:26 GMT
- Title: Behavior Recognition Based on the Integration of Multigranular Motion
Features
- Authors: Lizong Zhang, Yiming Wang, Bei Hui, Xiujian Zhang, Sijuan Liu and
Shuxin Feng
- Abstract summary: We propose a novel behavior recognition method based on the integration of multigranular (IMG) motion features.
We evaluate our model on several action recognition benchmarks such as HMDB51, Something-Something and UCF101.
- Score: 17.052997301790693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recognition of behaviors in videos usually requires a combinatorial
analysis of the spatial information about objects and their dynamic action
information in the temporal dimension. Specifically, behavior recognition may
even rely more on the modeling of temporal information containing short-range
and long-range motions; this contrasts with computer vision tasks involving
images that focus on the understanding of spatial information. However, current
solutions fail to jointly and comprehensively analyze short-range motion
between adjacent frames and long-range temporal aggregations at large scales in
videos. In this paper, we propose a novel behavior recognition method based on
the integration of multigranular (IMG) motion features. In particular, we
achieve reliable motion information modeling through the synergy of a channel
attention-based short-term motion feature enhancement module (CMEM) and a
cascaded long-term motion feature integration module (CLIM). We evaluate our
model on several action recognition benchmarks such as HMDB51,
Something-Something and UCF101. The experimental results demonstrate that our
approach outperforms the previous state-of-the-art methods, which confirms its
effectiveness and efficiency.
Related papers
- Spatial-Temporal Multi-level Association for Video Object Segmentation [89.32226483171047]
This paper proposes spatial-temporal multi-level association, which jointly associates reference frame, test frame, and object features.
Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features.
arXiv Detail & Related papers (2024-04-09T12:44:34Z) - Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications.
Traditional methods rely on hand-crafted features and machine learning techniques.
We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - Long-Short Temporal Modeling for Efficient Action Recognition [32.159784061961886]
We propose a new two-stream action recognition network, termed as MENet, consisting of a Motion Enhancement (ME) module and a Video-level Aggregation (VLA) module.
For short-term motions, we design an efficient ME module to enhance the short-term motions by mingling the motion saliency among neighboring segments.
As for long-term aggregations, VLA is adopted at the top of the appearance branch to integrate the long-term dependencies across all segments.
arXiv Detail & Related papers (2021-06-30T02:54:13Z) - TSI: Temporal Saliency Integration for Video Action Recognition [32.18535820790586]
We propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module.
SME aims to highlight the motion-sensitive area through local-global motion modeling.
CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively.
arXiv Detail & Related papers (2021-06-02T11:43:49Z) - Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels.
We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects.
The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Learning Self-Similarity in Space and Time as Generalized Motion for
Action Recognition [42.175450800733785]
We propose a rich motion representation based on video self-similarity (STSS)
We leverage the whole volume of STSSS and let our model learn to extract an effective motion representation from it.
The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision.
arXiv Detail & Related papers (2021-02-14T07:32:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.