Deep Motion Prior for Weakly-Supervised Temporal Action Localization
- URL: http://arxiv.org/abs/2108.05607v1
- Date: Thu, 12 Aug 2021 08:51:36 GMT
- Title: Deep Motion Prior for Weakly-Supervised Temporal Action Localization
- Authors: Meng Cao, Can Zhang, Long Chen, Mike Zheng Shou, Yuexian Zou
- Abstract summary: Weakly-Supervised Temporal Action localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels.
Currently, most state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline.
We argue that existing methods have overlooked two important drawbacks: 1) inadequate use of motion information and 2) the incompatibility of prevailing cross-entropy training loss.
- Score: 35.25323276744999
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize
actions in untrimmed videos with only video-level labels. Currently, most
state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline:
producing snippet-level predictions first and then aggregating to the
video-level prediction. However, we argue that existing methods have overlooked
two important drawbacks: 1) inadequate use of motion information and 2) the
incompatibility of prevailing cross-entropy training loss. In this paper, we
analyze that the motion cues behind the optical flow features are complementary
informative. Inspired by this, we propose to build a context-dependent motion
prior, termed as motionness. Specifically, a motion graph is introduced to
model motionness based on the local motion carrier (e.g., optical flow). In
addition, to highlight more informative video snippets, a motion-guided loss is
proposed to modulate the network training conditioned on motionness scores.
Extensive ablation studies confirm that motionness efficaciously models
action-of-interest, and the motion-guided loss leads to more accurate results.
Besides, our motion-guided loss is a plug-and-play loss function and is
applicable with existing WSTAL methods. Without loss of generality, based on
the standard MIL pipeline, our method achieves new state-of-the-art performance
on three challenging benchmarks, including THUMOS'14, ActivityNet v1.2 and
v1.3.
Related papers
- Generalizable Implicit Motion Modeling for Video Frame Interpolation [51.966062283735596]
Motion is critical in flow-based Video Frame Interpolation (VFI)
General Implicit Motion Modeling (IMM) is a novel and effective approach to motion modeling VFI.
Our GIMM can be smoothly integrated with existing flow-based VFI works without further modifications.
arXiv Detail & Related papers (2024-07-11T17:13:15Z) - MotionTrack: Learning Motion Predictor for Multiple Object Tracking [68.68339102749358]
We introduce a novel motion-based tracker, MotionTrack, centered around a learnable motion predictor.
Our experimental results demonstrate that MotionTrack yields state-of-the-art performance on datasets such as Dancetrack and SportsMOT.
arXiv Detail & Related papers (2023-06-05T04:24:11Z) - Improving Unsupervised Video Object Segmentation with Motion-Appearance
Synergy [52.03068246508119]
We present IMAS, a method that segments the primary objects in videos without manual annotation in training or inference.
IMAS achieves Improved UVOS with Motion-Appearance Synergy.
We demonstrate its effectiveness in tuning critical hyperparams previously tuned with human annotation or hand-crafted hyperparam-specific metrics.
arXiv Detail & Related papers (2022-12-17T06:47:30Z) - Treating Motion as Option to Reduce Motion Dependency in Unsupervised
Video Object Segmentation [5.231219025536678]
Unsupervised video object segmentation (VOS) aims to detect the most salient object in a video sequence at the pixel level.
Most state-of-the-art methods leverage motion cues obtained from optical flow maps in addition to appearance cues to exploit the property that salient objects usually have distinctive movements compared to the background.
arXiv Detail & Related papers (2022-09-04T18:05:52Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z) - Motion Guided 3D Pose Estimation from Videos [81.14443206968444]
We propose a new loss function, called motion loss, for the problem of monocular 3D Human pose estimation from 2D pose.
In computing motion loss, a simple yet effective representation for keypoint motion, called pairwise motion encoding, is introduced.
We design a new graph convolutional network architecture, U-shaped GCN (UGCN), which captures both short-term and long-term motion information.
arXiv Detail & Related papers (2020-04-29T06:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.