Related papers: Dynamic Temporal Filtering in Video Models

Dynamic Temporal Filtering in Video Models

URL: http://arxiv.org/abs/2211.08252v1
Date: Tue, 15 Nov 2022 15:59:28 GMT
Title: Dynamic Temporal Filtering in Video Models
Authors: Fuchen Long and Zhaofan Qiu and Yingwei Pan and Ting Yao and Chong-Wah Ngo and Tao Mei
Abstract summary: We present a new recipe of temporal feature learning, namely Dynamic Temporal Filter (DTF) DTF learns a specialized frequency filter for every spatial location to model its long-range temporal dynamics. It is feasible to plug DTF block into ConvNets and Transformer, yielding DTF-Net and DTF-Transformer.
Score: 128.02725199486719
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video temporal dynamics is conventionally modeled with 3D spatial-temporal kernel or its factorized version comprised of 2D spatial kernel and 1D temporal kernel. The modeling power, nevertheless, is limited by the fixed window size and static weights of a kernel along the temporal dimension. The pre-determined kernel size severely limits the temporal receptive fields and the fixed weights treat each spatial location across frames equally, resulting in sub-optimal solution for long-range temporal modeling in natural scenes. In this paper, we present a new recipe of temporal feature learning, namely Dynamic Temporal Filter (DTF), that novelly performs spatial-aware temporal modeling in frequency domain with large temporal receptive field. Specifically, DTF dynamically learns a specialized frequency filter for every spatial location to model its long-range temporal dynamics. Meanwhile, the temporal feature of each spatial location is also transformed into frequency feature spectrum via 1D Fast Fourier Transform (FFT). The spectrum is modulated by the learnt frequency filter, and then transformed back to temporal domain with inverse FFT. In addition, to facilitate the learning of frequency filter in DTF, we perform frame-wise aggregation to enhance the primary temporal feature with its temporal neighbors by inter-frame correlation. It is feasible to plug DTF block into ConvNets and Transformer, yielding DTF-Net and DTF-Transformer. Extensive experiments conducted on three datasets demonstrate the superiority of our proposals. More remarkably, DTF-Transformer achieves an accuracy of 83.5% on Kinetics-400 dataset. Source code is available at \url{https://github.com/FuchenUSTC/DTF}.

Related papers

VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate [16.826081397057774]
VGDFR is a training-free approach for Diffusion-based Video Generation with Dynamic Latent Frame Rate. We show that VGDFR can achieve a speedup up to 3x for video generation with minimal quality degradation.
arXiv Detail & Related papers (2025-04-16T17:09:13Z)
Neural Fourier Modelling: A Highly Compact Approach to Time-Series Analysis [9.969451740838418]
We introduce Neural Fourier Modelling (NFM), a compact yet powerful solution for time-series analysis. NFM is grounded in two key properties of the Fourier transform (FT): (i) the ability to model finite-length time series as functions in the Fourier domain, and (ii) the capacity for data manipulation within the Fourier domain. NFM achieves state-of-the-art performance on a wide range of tasks, including challenging time-series scenarios with previously unseen sampling rates at test time.
arXiv Detail & Related papers (2024-10-07T02:39:55Z)
Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs. We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z)
GLFNET: Global-Local (frequency) Filter Networks for efficient medical image segmentation [18.314093733807972]
We propose a transformer-style architecture called Global-Local Filter Network (GLFNet) for medical image segmentation. We replace the self-attention mechanism with a combination of global-local filter blocks to optimize model efficiency. We test GLFNet on three benchmark datasets achieving state-of-the-art performance on all of them while being almost twice as efficient in terms of GFLOP operations.
arXiv Detail & Related papers (2024-03-01T09:35:03Z)
EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision [85.17951804790515]
EmerNeRF is a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. It simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. Our method achieves state-of-the-art performance in sensor simulation.
arXiv Detail & Related papers (2023-11-03T17:59:55Z)
Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time. This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z)
FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain. Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z)
Fourier PlenOctrees for Dynamic Radiance Field Rendering in Real-time [43.0484840009621]
Implicit neural representations such as Neural Radiance Field (NeRF) have focused mainly on modeling static objects captured under multi-view settings. We present a novel Fourier PlenOctree (FPO) technique to tackle efficient neural modeling and real-time rendering of dynamic scenes captured under the free-view video (FVV) setting. We show that the proposed method is 3000 times faster than the original NeRF and over an order of magnitude acceleration over SOTA.
arXiv Detail & Related papers (2022-02-17T11:57:01Z)
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification [12.787763599624173]
We propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D. Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both SomethingSomething and Kinetics.
arXiv Detail & Related papers (2020-12-01T07:40:06Z)
TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map. Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.