PAN: Towards Fast Action Recognition via Learning Persistence of
Appearance
- URL: http://arxiv.org/abs/2008.03462v1
- Date: Sat, 8 Aug 2020 07:09:54 GMT
- Title: PAN: Towards Fast Action Recognition via Learning Persistence of
Appearance
- Authors: Can Zhang, Yuexian Zou, Guang Chen, Lei Gan
- Abstract summary: Most state-of-the-art methods heavily rely on dense optical flow as motion representation.
In this paper, we shed light on fast action recognition by lifting the reliance on optical flow.
We design a novel motion cue called Persistence of Appearance (PA)
In contrast to optical flow, our PA focuses more on distilling the motion information at boundaries.
- Score: 60.75488333935592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficiently modeling dynamic motion information in videos is crucial for
action recognition task. Most state-of-the-art methods heavily rely on dense
optical flow as motion representation. Although combining optical flow with RGB
frames as input can achieve excellent recognition performance, the optical flow
extraction is very time-consuming. This undoubtably will count against
real-time action recognition. In this paper, we shed light on fast action
recognition by lifting the reliance on optical flow. Our motivation lies in the
observation that small displacements of motion boundaries are the most critical
ingredients for distinguishing actions, so we design a novel motion cue called
Persistence of Appearance (PA). In contrast to optical flow, our PA focuses
more on distilling the motion information at boundaries. Also, it is more
efficient by only accumulating pixel-wise differences in feature space, instead
of using exhaustive patch-wise search of all the possible motion vectors. Our
PA is over 1000x faster (8196fps vs. 8fps) than conventional optical flow in
terms of motion modeling speed. To further aggregate the short-term dynamics in
PA to long-term dynamics, we also devise a global temporal fusion strategy
called Various-timescale Aggregation Pooling (VAP) that can adaptively model
long-range temporal relationships across various timescales. We finally
incorporate the proposed PA and VAP to form a unified framework called
Persistent Appearance Network (PAN) with strong temporal modeling ability.
Extensive experiments on six challenging action recognition benchmarks verify
that our PAN outperforms recent state-of-the-art methods at low FLOPs. Codes
and models are available at: https://github.com/zhang-can/PAN-PyTorch.
Related papers
- PASTA: Towards Flexible and Efficient HDR Imaging Via Progressively Aggregated Spatio-Temporal Alignment [91.38256332633544]
PASTA is a Progressively Aggregated Spatio-Temporal Alignment framework for HDR deghosting.
Our approach achieves effectiveness and efficiency by harnessing hierarchical representation during feature distanglement.
Experimental results showcase PASTA's superiority over current SOTA methods in both visual quality and performance metrics.
arXiv Detail & Related papers (2024-03-15T15:05:29Z) - Flow Dynamics Correction for Action Recognition [43.95003560364798]
We show that existing action recognition models which rely on optical flow are able to get performance boosted with our corrected optical flow.
We integrate our corrected flow dynamics into popular models through a simple step by selecting only the best performing optical flow features.
arXiv Detail & Related papers (2023-10-16T04:49:06Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot
Action Recognition [50.345327516891615]
We develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder.
MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching.
arXiv Detail & Related papers (2023-04-03T13:09:39Z) - StreamYOLO: Real-time Object Detection for Streaming Perception [84.2559631820007]
We endow the models with the capacity of predicting the future, significantly improving the results for streaming perception.
We consider multiple velocities driving scene and propose Velocity-awared streaming AP (VsAP) to jointly evaluate the accuracy.
Our simple method achieves the state-of-the-art performance on Argoverse-HD dataset and improves the sAP and VsAP by 4.7% and 8.2% respectively.
arXiv Detail & Related papers (2022-07-21T12:03:02Z) - Long-Short Temporal Modeling for Efficient Action Recognition [32.159784061961886]
We propose a new two-stream action recognition network, termed as MENet, consisting of a Motion Enhancement (ME) module and a Video-level Aggregation (VLA) module.
For short-term motions, we design an efficient ME module to enhance the short-term motions by mingling the motion saliency among neighboring segments.
As for long-term aggregations, VLA is adopted at the top of the appearance branch to integrate the long-term dependencies across all segments.
arXiv Detail & Related papers (2021-06-30T02:54:13Z) - TSI: Temporal Saliency Integration for Video Action Recognition [32.18535820790586]
We propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module.
SME aims to highlight the motion-sensitive area through local-global motion modeling.
CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively.
arXiv Detail & Related papers (2021-06-02T11:43:49Z) - Unsupervised Motion Representation Enhanced Network for Action
Recognition [4.42249337449125]
Motion representation between consecutive frames has proven to have great promotion to video understanding.
TV-L1 method, an effective optical flow solver, is time-consuming and expensive in storage for caching the extracted optical flow.
We propose UF-TSN, a novel end-to-end action recognition approach enhanced with an embedded lightweight unsupervised optical flow estimator.
arXiv Detail & Related papers (2021-03-05T04:14:32Z) - Learning Self-Similarity in Space and Time as Generalized Motion for
Action Recognition [42.175450800733785]
We propose a rich motion representation based on video self-similarity (STSS)
We leverage the whole volume of STSSS and let our model learn to extract an effective motion representation from it.
The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision.
arXiv Detail & Related papers (2021-02-14T07:32:55Z) - FLAVR: Flow-Agnostic Video Representations for Fast Frame Interpolation [97.99012124785177]
FLAVR is a flexible and efficient architecture that uses 3D space-time convolutions to enable end-to-end learning and inference for video framesupervised.
We demonstrate that FLAVR can serve as a useful self- pretext task for action recognition, optical flow estimation, and motion magnification.
arXiv Detail & Related papers (2020-12-15T18:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.