EAN: Event Adaptive Network for Enhanced Action Recognition
- URL: http://arxiv.org/abs/2107.10771v1
- Date: Thu, 22 Jul 2021 15:57:18 GMT
- Title: EAN: Event Adaptive Network for Enhanced Action Recognition
- Authors: Yuan Tian, Yichao Yan, Xiongkuo Min, Guo Lu, Guangtao Zhai, Guodong
Guo, and Zhiyong Gao
- Abstract summary: We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
- Score: 66.81780707955852
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficiently modeling spatial-temporal information in videos is crucial for
action recognition. To achieve this goal, state-of-the-art methods typically
employ the convolution operator and the dense interaction modules such as
non-local blocks. However, these methods cannot accurately fit the diverse
events in videos. On the one hand, the adopted convolutions are with fixed
scales, thus struggling with events of various scales. On the other hand, the
dense interaction modeling paradigm only achieves sub-optimal performance as
action-irrelevant parts bring additional noises for the final prediction. In
this paper, we propose a unified action recognition framework to investigate
the dynamic nature of video content by introducing the following designs.
First, when extracting local cues, we generate the spatial-temporal kernels of
dynamic-scale to adaptively fit the diverse events. Second, to accurately
aggregate these cues into a global video representation, we propose to mine the
interactions only among a few selected foreground objects by a Transformer,
which yields a sparse paradigm. We call the proposed framework as Event
Adaptive Network (EAN) because both key designs are adaptive to the input video
content. To exploit the short-term motions within local segments, we propose a
novel and efficient Latent Motion Code (LMC) module, further improving the
performance of the framework. Extensive experiments on several large-scale
video datasets, e.g., Something-to-Something V1&V2, Kinetics, and Diving48,
verify that our models achieve state-of-the-art or competitive performances at
low FLOPs. Codes are available at:
https://github.com/tianyuan168326/EAN-Pytorch.
Related papers
- Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - An end-to-end multi-scale network for action prediction in videos [31.967024536359908]
We develop an efficient multi-scale network to predict action classes in partial videos in an end-to-end manner.
Our E2EMSNet is evaluated on three challenging datasets: BIT, HMDB51, and UCF101.
arXiv Detail & Related papers (2022-12-31T06:58:41Z) - Efficient Unsupervised Video Object Segmentation Network Based on Motion
Guidance [1.5736899098702974]
This paper proposes a video object segmentation network based on motion guidance.
The model comprises a dual-stream network, motion guidance module, and multi-scale progressive fusion module.
The experimental results prove the superior performance of the proposed method.
arXiv Detail & Related papers (2022-11-10T06:13:23Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.