Frame Flexible Network
- URL: http://arxiv.org/abs/2303.14817v1
- Date: Sun, 26 Mar 2023 20:51:35 GMT
- Title: Frame Flexible Network
- Authors: Yitian Zhang, Yue Bai, Chang Liu, Huan Wang, Sheng Li, Yun Fu
- Abstract summary: Existing video recognition algorithms always conduct different training pipelines for inputs with different frame numbers.
If we evaluate the model using other frames which are not used in training, we observe the performance will drop significantly.
We propose a general framework, named Frame Flexible Network (FFN), which enables the model to be evaluated at different frames to adjust its computation.
- Score: 52.623337134518835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing video recognition algorithms always conduct different training
pipelines for inputs with different frame numbers, which requires repetitive
training operations and multiplying storage costs. If we evaluate the model
using other frames which are not used in training, we observe the performance
will drop significantly (see Fig.1), which is summarized as Temporal Frequency
Deviation phenomenon. To fix this issue, we propose a general framework, named
Frame Flexible Network (FFN), which not only enables the model to be evaluated
at different frames to adjust its computation, but also reduces the memory
costs of storing multiple models significantly. Concretely, FFN integrates
several sets of training sequences, involves Multi-Frequency Alignment (MFAL)
to learn temporal frequency invariant representations, and leverages
Multi-Frequency Adaptation (MFAD) to further strengthen the representation
abilities. Comprehensive empirical validations using various architectures and
popular benchmarks solidly demonstrate the effectiveness and generalization of
FFN (e.g., 7.08/5.15/2.17% performance gain at Frame 4/8/16 on
Something-Something V1 dataset over Uniformer). Code is available at
https://github.com/BeSpontaneous/FFN.
Related papers
- TIDE: Temporally Incremental Disparity Estimation via Pattern Flow in
Structured Light System [17.53719804060679]
TIDE-Net is a learning-based technique for disparity computation in mono-camera structured light systems.
We exploit the deformation of projected patterns (named pattern flow) on captured image sequences to model the temporal information.
For each incoming frame, our model fuses correlation volumes (from current frame) and disparity (from former frame) warped by pattern flow.
arXiv Detail & Related papers (2023-10-13T07:55:33Z) - Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience.
Existing methods have achieved great success by employing advanced motion models and synthesis networks.
WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - Towards Frame Rate Agnostic Multi-Object Tracking [76.82407173177138]
We propose a Frame Rate Agnostic MOT framework with a Periodic training Scheme (FAPS) to tackle the FraMOT problem for the first time.
Specifically, we propose a Frame Rate Agnostic Association Module (FAAM) that infers and encodes the frame rate information.
FAPS reflects all post-processing steps in training via tracking pattern matching and fusion.
arXiv Detail & Related papers (2022-09-23T04:25:19Z) - HyperTime: Implicit Neural Representation for Time Series [131.57172578210256]
Implicit neural representations (INRs) have recently emerged as a powerful tool that provides an accurate and resolution-independent encoding of data.
In this paper, we analyze the representation of time series using INRs, comparing different activation functions in terms of reconstruction accuracy and training convergence speed.
We propose a hypernetwork architecture that leverages INRs to learn a compressed latent representation of an entire time series dataset.
arXiv Detail & Related papers (2022-08-11T14:05:51Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - Few Shot Activity Recognition Using Variational Inference [9.371378627575883]
We propose a novel variational inference based architectural framework (HF-AR) for few shot activity recognition.
Our framework leverages volume-preserving Householder Flow to learn a flexible posterior distribution of the novel classes.
This results in better performance as compared to state-of-the-art few shot approaches for human activity recognition.
arXiv Detail & Related papers (2021-08-20T03:57:58Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.