Perceptron Synthesis Network: Rethinking the Action Scale Variances in
Videos
- URL: http://arxiv.org/abs/2007.11460v3
- Date: Tue, 19 Apr 2022 13:32:32 GMT
- Title: Perceptron Synthesis Network: Rethinking the Action Scale Variances in
Videos
- Authors: Yuan Tian, Guangtao Zhai, Zhiyong Gao
- Abstract summary: Video action recognition has been partially addressed by the CNNs stacking of fixed-size 3D kernels.
We propose to learn the optimal-scale kernels from the data.
An textitaction perceptron synthesizer is proposed to generate the kernels from a bag of fixed-size kernels.
- Score: 48.57686258913474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video action recognition has been partially addressed by the CNNs stacking of
fixed-size 3D kernels. However, these methods may under-perform for only
capturing rigid spatial-temporal patterns in single-scale spaces, while
neglecting the scale variances across different action primitives. To overcome
this limitation, we propose to learn the optimal-scale kernels from the data.
More specifically, an \textit{action perceptron synthesizer} is proposed to
generate the kernels from a bag of fixed-size kernels that are interacted by
dense routing paths. To guarantee the interaction richness and the information
capacity of the paths, we design the novel \textit{optimized feature fusion
layer}. This layer establishes a principled universal paradigm that suffices to
cover most of the current feature fusion techniques (e.g., channel shuffling,
and channel dropout) for the first time. By inserting the \textit{synthesizer},
our method can easily adapt the traditional 2D CNNs to the video understanding
tasks such as action recognition with marginal additional computation cost. The
proposed method is thoroughly evaluated over several challenging datasets
(i.e., Somehting-to-Somthing, Kinetics and Diving48) that highly require
temporal reasoning or appearance discriminating, achieving new state-of-the-art
results. Particularly, our low-resolution model outperforms the recent strong
baseline methods, i.e., TSM and GST, with less than 30\% of their computation
cost.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Leveraging Neural Radiance Field in Descriptor Synthesis for Keypoints Scene Coordinate Regression [1.2974519529978974]
This paper introduces a pipeline for keypoint descriptor synthesis using Neural Radiance Field (NeRF)
generating novel poses and feeding them into a trained NeRF model to create new views, our approach enhances the KSCR's capabilities in data-scarce environments.
The proposed system could significantly improve localization accuracy by up to 50% and cost only a fraction of time for data synthesis.
arXiv Detail & Related papers (2024-03-15T13:40:37Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Heterogenous Memory Augmented Neural Networks [84.29338268789684]
We introduce a novel heterogeneous memory augmentation approach for neural networks.
By introducing learnable memory tokens with attention mechanism, we can effectively boost performance without huge computational overhead.
We show our approach on various image and graph-based tasks under both in-distribution (ID) and out-of-distribution (OOD) conditions.
arXiv Detail & Related papers (2023-10-17T01:05:28Z) - Multi-encoder Network for Parameter Reduction of a Kernel-based
Interpolation Architecture [10.08097582267397]
Convolutional neural networks (CNNs) have been at the forefront of the recent advances in this field.
Many of these networks require a lot of parameters, with more parameters meaning a heavier burden.
This paper presents a method for parameter reduction for a popular flow-less kernel-based network.
arXiv Detail & Related papers (2022-05-13T16:02:55Z) - Gate-Shift-Fuse for Video Action Recognition [43.8525418821458]
Gate-Fuse (GSF) is a novel-temporal feature extraction module which controls interactions in-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner.
GSF can be inserted into existing 2D CNNs to convert them into efficient and high performing, with negligible parameter and compute overhead.
We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.
arXiv Detail & Related papers (2022-03-16T19:19:04Z) - Scene Synthesis via Uncertainty-Driven Attribute Synchronization [52.31834816911887]
This paper introduces a novel neural scene synthesis approach that can capture diverse feature patterns of 3D scenes.
Our method combines the strength of both neural network-based and conventional scene synthesis approaches.
arXiv Detail & Related papers (2021-08-30T19:45:07Z) - Leveraging Third-Order Features in Skeleton-Based Action Recognition [26.349722372701482]
Skeleton sequences are light-weight and compact, and thus ideal candidates for action recognition on edge devices.
Recent action recognition methods extract features from 3D joint coordinates as spatial-temporal cues, using these representations in a graph neural network for feature fusion.
We propose fusing third-order features in the form of angles into modern architectures, to robustly capture the relationships between joints and body parts.
arXiv Detail & Related papers (2021-05-04T15:23:29Z) - DyCo3D: Robust Instance Segmentation of 3D Point Clouds through Dynamic
Convolution [136.7261709896713]
We propose a data-driven approach that generates the appropriate convolution kernels to apply in response to the nature of the instances.
The proposed method achieves promising results on both ScanetNetV2 and S3DIS.
It also improves inference speed by more than 25% over the current state-of-the-art.
arXiv Detail & Related papers (2020-11-26T14:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.