Slow Motion Matters: A Slow Motion Enhanced Network for Weakly
Supervised Temporal Action Localization
- URL: http://arxiv.org/abs/2211.11324v1
- Date: Mon, 21 Nov 2022 10:15:19 GMT
- Title: Slow Motion Matters: A Slow Motion Enhanced Network for Weakly
Supervised Temporal Action Localization
- Authors: Weiqi Sun, Rui Su, Qian Yu and Dong Xu
- Abstract summary: Weakly supervised temporal action localization aims to localize actions in untrimmed videos with only weak supervision information.
It is hard to explore salient slow-motion information from videos at normal" speed.
We propose a novel framework termed Slow Motion Enhanced Network (SMEN) to improve the ability of a WTAL network by compensating its sensitivity on slow-motion action segments.
- Score: 31.54214885700785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised temporal action localization (WTAL) aims to localize
actions in untrimmed videos with only weak supervision information (e.g.
video-level labels). Most existing models handle all input videos with a fixed
temporal scale. However, such models are not sensitive to actions whose pace of
the movements is different from the ``normal" speed, especially slow-motion
action instances, which complete the movements with a much slower speed than
their counterparts with a normal speed. Here arises the slow-motion blurred
issue: It is hard to explore salient slow-motion information from videos at
``normal" speed. In this paper, we propose a novel framework termed Slow Motion
Enhanced Network (SMEN) to improve the ability of a WTAL network by
compensating its sensitivity on slow-motion action segments. The proposed SMEN
comprises a Mining module and a Localization module. The mining module
generates mask to mine slow-motion-related features by utilizing the
relationships between the normal motion and slow motion; while the localization
module leverages the mined slow-motion features as complementary information to
improve the temporal action localization results. Our proposed framework can be
easily adapted by existing WTAL networks and enable them be more sensitive to
slow-motion actions. Extensive experiments on three benchmarks are conducted,
which demonstrate the high performance of our proposed framework.
Related papers
- Motion meets Attention: Video Motion Prompts [34.429192862783054]
We propose a modified Sigmoid function with learnable slope and shift parameters as an attention mechanism to modulate motion signals from frame differencing maps.
This approach generates a sequence of attention maps that enhance the processing of motion-related video content.
We show that our lightweight, plug-and-play motion prompt layer seamlessly integrates into models like SlowGym, X3D, and Timeformer.
arXiv Detail & Related papers (2024-07-03T14:59:46Z) - MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion [94.66090422753126]
MotionFollower is a lightweight score-guided diffusion model for video motion editing.
It delivers superior motion editing performance and exclusively supports large camera movements and actions.
Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory.
arXiv Detail & Related papers (2024-05-30T17:57:30Z) - Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms.
SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics.
Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z) - Follow-Your-Click: Open-domain Regional Image Animation via Short
Prompts [67.5094490054134]
We propose a practical framework, named Follow-Your-Click, to achieve image animation with a simple user click.
Our framework has simpler yet precise user control and better generation performance than previous methods.
arXiv Detail & Related papers (2024-03-13T05:44:37Z) - MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot
Action Recognition [50.345327516891615]
We develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder.
MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching.
arXiv Detail & Related papers (2023-04-03T13:09:39Z) - Treating Motion as Option to Reduce Motion Dependency in Unsupervised
Video Object Segmentation [5.231219025536678]
Unsupervised video object segmentation (VOS) aims to detect the most salient object in a video sequence at the pixel level.
Most state-of-the-art methods leverage motion cues obtained from optical flow maps in addition to appearance cues to exploit the property that salient objects usually have distinctive movements compared to the background.
arXiv Detail & Related papers (2022-09-04T18:05:52Z) - Deep Motion Prior for Weakly-Supervised Temporal Action Localization [35.25323276744999]
Weakly-Supervised Temporal Action localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels.
Currently, most state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline.
We argue that existing methods have overlooked two important drawbacks: 1) inadequate use of motion information and 2) the incompatibility of prevailing cross-entropy training loss.
arXiv Detail & Related papers (2021-08-12T08:51:36Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Self-supervised Motion Learning from Static Images [36.85209332144106]
Motion from Static Images (MoSI) learns to encode motion information.
MoSI can discover regions with large motion even without fine-tuning on the downstream datasets.
We demonstrate that MoSI can discover regions with large motion even without fine-tuning on the downstream datasets.
arXiv Detail & Related papers (2021-04-01T03:55:50Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.