Dynamic Temporal Filtering in Video Models
- URL: http://arxiv.org/abs/2211.08252v1
- Date: Tue, 15 Nov 2022 15:59:28 GMT
- Title: Dynamic Temporal Filtering in Video Models
- Authors: Fuchen Long and Zhaofan Qiu and Yingwei Pan and Ting Yao and Chong-Wah
Ngo and Tao Mei
- Abstract summary: We present a new recipe of temporal feature learning, namely Dynamic Temporal Filter (DTF)
DTF learns a specialized frequency filter for every spatial location to model its long-range temporal dynamics.
It is feasible to plug DTF block into ConvNets and Transformer, yielding DTF-Net and DTF-Transformer.
- Score: 128.02725199486719
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video temporal dynamics is conventionally modeled with 3D spatial-temporal
kernel or its factorized version comprised of 2D spatial kernel and 1D temporal
kernel. The modeling power, nevertheless, is limited by the fixed window size
and static weights of a kernel along the temporal dimension. The pre-determined
kernel size severely limits the temporal receptive fields and the fixed weights
treat each spatial location across frames equally, resulting in sub-optimal
solution for long-range temporal modeling in natural scenes. In this paper, we
present a new recipe of temporal feature learning, namely Dynamic Temporal
Filter (DTF), that novelly performs spatial-aware temporal modeling in
frequency domain with large temporal receptive field. Specifically, DTF
dynamically learns a specialized frequency filter for every spatial location to
model its long-range temporal dynamics. Meanwhile, the temporal feature of each
spatial location is also transformed into frequency feature spectrum via 1D
Fast Fourier Transform (FFT). The spectrum is modulated by the learnt frequency
filter, and then transformed back to temporal domain with inverse FFT. In
addition, to facilitate the learning of frequency filter in DTF, we perform
frame-wise aggregation to enhance the primary temporal feature with its
temporal neighbors by inter-frame correlation. It is feasible to plug DTF block
into ConvNets and Transformer, yielding DTF-Net and DTF-Transformer. Extensive
experiments conducted on three datasets demonstrate the superiority of our
proposals. More remarkably, DTF-Transformer achieves an accuracy of 83.5% on
Kinetics-400 dataset. Source code is available at
\url{https://github.com/FuchenUSTC/DTF}.
Related papers
- Neural Fourier Modelling: A Highly Compact Approach to Time-Series Analysis [9.969451740838418]
We introduce Neural Fourier Modelling (NFM), a compact yet powerful solution for time-series analysis.
NFM is grounded in two key properties of the Fourier transform (FT): (i) the ability to model finite-length time series as functions in the Fourier domain, and (ii) the capacity for data manipulation within the Fourier domain.
NFM achieves state-of-the-art performance on a wide range of tasks, including challenging time-series scenarios with previously unseen sampling rates at test time.
arXiv Detail & Related papers (2024-10-07T02:39:55Z) - Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs.
We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation.
With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z) - GLFNET: Global-Local (frequency) Filter Networks for efficient medical
image segmentation [18.314093733807972]
We propose a transformer-style architecture called Global-Local Filter Network (GLFNet) for medical image segmentation.
We replace the self-attention mechanism with a combination of global-local filter blocks to optimize model efficiency.
We test GLFNet on three benchmark datasets achieving state-of-the-art performance on all of them while being almost twice as efficient in terms of GFLOP operations.
arXiv Detail & Related papers (2024-03-01T09:35:03Z) - EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via
Self-Supervision [85.17951804790515]
EmerNeRF is a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes.
It simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping.
Our method achieves state-of-the-art performance in sensor simulation.
arXiv Detail & Related papers (2023-11-03T17:59:55Z) - Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time.
This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z) - FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain.
Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z) - Fourier PlenOctrees for Dynamic Radiance Field Rendering in Real-time [43.0484840009621]
Implicit neural representations such as Neural Radiance Field (NeRF) have focused mainly on modeling static objects captured under multi-view settings.
We present a novel Fourier PlenOctree (FPO) technique to tackle efficient neural modeling and real-time rendering of dynamic scenes captured under the free-view video (FVV) setting.
We show that the proposed method is 3000 times faster than the original NeRF and over an order of magnitude acceleration over SOTA.
arXiv Detail & Related papers (2022-02-17T11:57:01Z) - Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization
for Efficient Video Classification [12.787763599624173]
We propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D.
Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both SomethingSomething and Kinetics.
arXiv Detail & Related papers (2020-12-01T07:40:06Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.