Spiking Variational Graph Representation Inference for Video Summarization
- URL: http://arxiv.org/abs/2508.15389v1
- Date: Thu, 21 Aug 2025 09:25:42 GMT
- Title: Spiking Variational Graph Representation Inference for Video Summarization
- Authors: Wenrui Li, Wei Han, Liang-Jian Deng, Ruiqin Xiong, Xiaopeng Fan,
- Abstract summary: We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity.<n>First, we design a extractor based on Spiking Neural Networks (SNN), leveraging the event-driven mechanism of SNNs to learn autonomously.<n>We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion.
- Score: 37.324654104567436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at https://github.com/liwrui/SpiVG.
Related papers
- Language-Guided Graph Representation Learning for Video Summarization [96.2763459348758]
We propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization.<n>Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies.<n>Our method outperforms existing approaches across multiple benchmarks.
arXiv Detail & Related papers (2025-11-14T04:35:48Z) - MSNeRV: Neural Video Representation with Multi-Scale Feature Fusion [27.621656985302973]
Implicit Neural representations (INRs) have emerged as a promising approach for video compression.<n>Existing INR-based methods struggle to effectively represent detail-intensive and fast-changing video content.<n>We propose a multi-scale feature fusion framework, MSNeRV, for neural video representation.
arXiv Detail & Related papers (2025-06-18T08:57:12Z) - DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor [22.35724335601674]
Video Quality Assessment (VQA) aims to evaluate video quality based on perceptual distortions and human preferences.<n>We introduce a novel VQA framework, DiffVQA, which harnesses the robust generalization capabilities of diffusion models pre-trained on extensive datasets.
arXiv Detail & Related papers (2025-05-06T07:42:24Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Progressive Fourier Neural Representation for Sequential Video
Compilation [75.43041679717376]
Motivated by continual learning, this work investigates how to accumulate and transfer neural implicit representations for multiple complex video data over sequential encoding sessions.
We propose a novel method, Progressive Fourier Neural Representation (PFNR), that aims to find an adaptive and compact sub-module in Fourier space to encode videos in each training session.
We validate our PFNR method on the UVG8/17 and DAVIS50 video sequence benchmarks and achieve impressive performance gains over strong continual learning baselines.
arXiv Detail & Related papers (2023-06-20T06:02:19Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - Bayesian Nonparametric Submodular Video Partition for Robust Anomaly
Detection [9.145168943972067]
Multiple-instance learning (MIL) provides an effective way to tackle the video anomaly detection problem.
We propose to conduct novel Bayesian non-parametric submodular video partition (BN-SVP) to significantly improve MIL model training.
Our theoretical analysis ensures a strong performance guarantee of the proposed algorithm.
arXiv Detail & Related papers (2022-03-24T04:00:49Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - Perceptron Synthesis Network: Rethinking the Action Scale Variances in
Videos [48.57686258913474]
Video action recognition has been partially addressed by the CNNs stacking of fixed-size 3D kernels.
We propose to learn the optimal-scale kernels from the data.
An textitaction perceptron synthesizer is proposed to generate the kernels from a bag of fixed-size kernels.
arXiv Detail & Related papers (2020-07-22T14:22:29Z) - Disentangling Multiple Features in Video Sequences using Gaussian
Processes in Variational Autoencoders [6.461473289206789]
We introduce MGP-VAE, a variational autoencoder which uses Gaussian processes (GP) to model the latent space for the unsupervised learning of disentangled representations in video sequences.
We use fractional Brownian motions (fBM) and Brownian bridges (BB) to enforce an inter-frame correlation structure in each independent channel, and show that varying this structure enables one to capture different factors of variation in the data.
arXiv Detail & Related papers (2020-01-08T08:08:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.