Related papers: DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition

URL: http://arxiv.org/abs/2507.12426v2
Date: Fri, 18 Jul 2025 14:10:43 GMT
Title: DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition
Authors: Hayat Ullah, Muhammad Ali Shafique, Abbas Khan, Arslan Munir,
Abstract summary: Video Modulation Network, DVFL-Net, distills knowledge from a large pre-trained teacher into a compact nano student model.<n>We evaluate DVFL-Net on UCF50, UCFMDB, H51, SSV2, and Kinetics benchmarking against state-of-the-art methods in Human Action Recognition (HAR)<n>The results confirm the superiority of DVFL-Net in achieving an optimal balance performance between memory and efficiency, demonstrating lower usage, reduced GFLOPs, and strong accuracy, making it a practical solution for real-time HAR applications.
Score: 1.4785166764137887
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The landscape of video recognition has evolved significantly, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures for improved accuracy. While 3D CNNs have been effective at capturing spatiotemporal dynamics, recent Transformer models leverage self-attention to model long-range spatial and temporal dependencies. Despite achieving state-of-the-art performance on major benchmarks, Transformers remain computationally expensive, particularly with dense video data. To address this, we propose a lightweight Video Focal Modulation Network, DVFL-Net, which distills spatiotemporal knowledge from a large pre-trained teacher into a compact nano student model, enabling efficient on-device deployment. DVFL-Net utilizes knowledge distillation and spatial-temporal feature modulation to significantly reduce computation while preserving high recognition performance. We employ forward Kullback-Leibler (KL) divergence alongside spatio-temporal focal modulation to effectively transfer both local and global context from the Video-FocalNet Base (teacher) to the proposed VFL-Net (student). We evaluate DVFL-Net on UCF50, UCF101, HMDB51, SSV2, and Kinetics-400, benchmarking it against recent state-of-the-art methods in Human Action Recognition (HAR). Additionally, we conduct a detailed ablation study analyzing the impact of forward KL divergence. The results confirm the superiority of DVFL-Net in achieving an optimal balance between performance and efficiency, demonstrating lower memory usage, reduced GFLOPs, and strong accuracy, making it a practical solution for real-time HAR applications.

Related papers

VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos [8.711160469571942]
This paper introduces VT-LVLM-AR (Video Large Vision-Language Model Adapter for Action Recognition), a novel framework designed to bridge this gap.<n>VTEM transforms raw video into semantically rich, and temporally coherent "visual event sequences"<n>The framework consistently achieves state-of-the-art performance, surpassing existing methods.
arXiv Detail & Related papers (2025-08-21T18:03:16Z)
TDS-CLIP: Temporal Difference Side Network for Efficient VideoAction Recognition [22.84073695186728]
We propose a memory-efficient Temporal Difference Side Network ( TDS-CLIP) to balance knowledge transferring and temporal modeling.<n>Specifically, we introduce a Temporal Difference Adapter (TD-Adapter), which can effectively capture local temporal differences in motion features.<n>We also designed a Side Motion Enhancement Adapter (SME-Adapter) to guide the proposed side network in efficiently learning the rich motion information in videos.
arXiv Detail & Related papers (2024-08-20T09:40:08Z)
Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation [4.242540533823568]
Transformer models are usually computationally-expensive, and their effectiveness in light-weight models are limited compared to convolutions. We propose a cross-architecture knowledge distillation method for MDE, dubbed DisDepth, to enhance efficient CNN models with the supervision of state-of-the-art transformer models. Our method achieves significant improvements on various efficient backbones, showcasing its potential for efficient monocular depth estimation.
arXiv Detail & Related papers (2024-04-25T07:55:47Z)
Streaming Anchor Loss: Augmenting Supervision with Temporal Significance [5.7654216719335105]
Streaming neural network models for fast frame-wise responses to various speech and sensory signals are widely adopted on resource-constrained platforms. We propose a new loss, Streaming Anchor Loss (SAL), to better utilize the given learning capacity by encouraging the model to learn more from essential frames.
arXiv Detail & Related papers (2023-10-09T17:28:35Z)
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts. Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention. We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z)
Spatiotemporal Attention-based Semantic Compression for Real-time Video Recognition [117.98023585449808]
We propose a temporal attention-based autoencoder (STAE) architecture to evaluate the importance of frames and pixels in each frame. We develop a lightweight decoder that leverages a 3D-2D CNN combined to reconstruct missing information. Experimental results show that ViT_STAE can compress the video dataset H51 by 104x with only 5% accuracy loss.
arXiv Detail & Related papers (2023-05-22T07:47:27Z)
Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer [1.876462046907555]
We propose a novel PSO-ConvNet model for learning actions in videos. Our experimental results on the UCF-101 dataset demonstrate substantial improvements of up to 9% in accuracy. Overall, our dynamic PSO-ConvNet model provides a promising direction for improving Human Action Recognition.
arXiv Detail & Related papers (2023-02-17T23:39:34Z)
STSM: Spatio-Temporal Shift Module for Efficient Action Recognition [4.096670184726871]
We propose a plug-and-play Spatio-temporal Shift Module (STSM) that is both effective and high-performance. In particular, when the network is 2D CNNs, our STSM module allows the network to learn efficient Spatio-temporal features.
arXiv Detail & Related papers (2021-12-05T09:40:49Z)
Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets. Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z)
MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency. MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)
Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations [67.18006078950337]
We use the global context information to enhance important channels and recalibrate salient time-frequency locations. The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset.
arXiv Detail & Related papers (2020-09-02T01:07:29Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.