Related papers: A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts

A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts

URL: http://arxiv.org/abs/2503.06064v1
Date: Sat, 08 Mar 2025 05:20:52 GMT
Title: A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts
Authors: Wenzhuo Du, Gerun Wang, Guancheng Chen, Hang Zhao, Xin Li, Jian Gao,
Abstract summary: Video-llama is an effective tool for generating video summarization, but it cannot effectively unify and optimize the modeling of temporal and spatial features.<n>We propose MiLoRA-ViSum to more efficiently capture complex temporal dynamics and spatial relationships inherent in video data.<n>MiLoRA-ViSum achieves the best summarization performance compared to state-of-the-art models, while maintaining significantly lower computational costs.
Score: 29.05750068740863
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the exponential growth of user-generated content on video-sharing platforms, the challenge of facilitating efficient searching and browsing of videos has garnered significant attention. To enhance users' ability to swiftly locate and review pertinent videos, the creation of concise and informative video summaries has become increasingly important. Video-llama is an effective tool for generating video summarization, but it cannot effectively unify and optimize the modeling of temporal and spatial features and requires a lot of computational resources and time. Therefore, we propose MiLoRA-ViSum to more efficiently capture complex temporal dynamics and spatial relationships inherent in video data and to control the number of parameters for training. By extending traditional Low-Rank Adaptation (LoRA) into a sophisticated mixture-of-experts paradigm, MiLoRA-ViSum incorporates a dual temporal-spatial adaptation mechanism tailored specifically for video summarization tasks. This approach dynamically integrates specialized LoRA experts, each fine-tuned to address distinct temporal or spatial dimensions. Extensive evaluations of the VideoXum and ActivityNet datasets demonstrate that MiLoRA-ViSum achieves the best summarization performance compared to state-of-the-art models, while maintaining significantly lower computational costs. The proposed mixture-of-experts strategy, combined with the dual adaptation mechanism, highlights the model's potential to enhance video summarization capabilities, particularly in large-scale applications requiring both efficiency and precision.

Related papers

TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness [9.374702244811303]
We introduce a self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers.<n>Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency.
arXiv Detail & Related papers (2025-06-25T16:27:38Z)
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z)
A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search [3.4271696759611068]
The exponential growth of digital video content has posed critical challenges in moment-level video retrieval. Current retrieval systems are constrained by computational inefficiencies, temporal context limitations, and the intrinsic complexity of navigating video content.
arXiv Detail & Related papers (2025-04-12T17:49:46Z)
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning [33.37714717781103]
VideoMind is a novel video-language agent designed for temporal-grounded video understanding. We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow. We propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors.
arXiv Detail & Related papers (2025-03-17T17:59:33Z)
Temporal Preference Optimization for Long-Form Video Understanding [28.623353303256653]
Temporal Preference Optimization (TPO) is a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs.<n>TPO significantly enhances temporal understanding while reducing reliance on manually annotated data.<n>LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark.
arXiv Detail & Related papers (2025-01-23T18:58:03Z)
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content. We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z)
VideoMamba: Spatio-Temporal Selective State Space Model [18.310796559944347]
VideoMamba is a novel adaptation of the pure Mamba architecture, specifically designed for video recognition. VideoMamba is not only resource-efficient but also effective in capturing long-range dependency in videos. Our work highlights the potential of VideoMamba as a powerful tool for video understanding, offering a simple yet effective baseline for future research in video analysis.
arXiv Detail & Related papers (2024-07-11T13:11:21Z)
Video-based Person Re-identification with Long Short-Term Representation Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras. We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z)
Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video. We propose Latent Time Navigation (LTN) to capture fine-grained motions. Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z)
Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z)
Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z)
A Gated Fusion Network for Dynamic Saliency Prediction [16.701214795454536]
Gated Fusion Network for dynamic saliency (GFSalNet) GFSalNet is first deep saliency model capable of making predictions in a dynamic way via gated fusion mechanism. We show that it has a good generalization ability, and moreover, exploits temporal information more effectively via its adaptive fusion scheme.
arXiv Detail & Related papers (2021-02-15T17:18:37Z)
An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time Video Enhancement [132.60976158877608]
We propose an efficient adversarial video enhancement framework that learns directly from unpaired video examples. In particular, our framework introduces new recurrent cells that consist of interleaved local and global modules for implicit integration of spatial and temporal information. The proposed design allows our recurrent cells to efficiently propagate-temporal-information across frames and reduces the need for high complexity networks.
arXiv Detail & Related papers (2020-12-24T00:03:29Z)
Exploring global diverse attention via pairwise temporal relation for video summarization [84.28263235895798]
We propose an efficient convolutional neural network architecture for video SUMmarization via Global Diverse Attention. The proposed models can be run in parallel with significantly less computational costs.
arXiv Detail & Related papers (2020-09-23T06:29:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.