Related papers: EdgeVidSum: Real-Time Personalized Video Summarization at the Edge

EdgeVidSum: Real-Time Personalized Video Summarization at the Edge

URL: http://arxiv.org/abs/2506.03171v1
Date: Wed, 28 May 2025 18:59:41 GMT
Title: EdgeVidSum: Real-Time Personalized Video Summarization at the Edge
Authors: Ghulam Mujtaba, Eun-Seok Ryu,
Abstract summary: EdgeVidSum is a method that generates personalized, fast-forward summaries of long-form videos directly on edge devices.<n>The framework employs a hierarchical analysis approach, where a lightweight 2D CNN model identifies user-preferred content from thumbnails.<n>Our interactive demo highlights the system's ability to create tailored video summaries for long-form videos, such as movies, sports events, and TV shows, based on individual user preferences.
Score: 3.102586911584193
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: EdgeVidSum is a lightweight method that generates personalized, fast-forward summaries of long-form videos directly on edge devices. The proposed approach enables real-time video summarization while safeguarding user privacy through local data processing using innovative thumbnail-based techniques and efficient neural architectures. Unlike conventional methods that process entire videos frame by frame, the proposed method uses thumbnail containers to significantly reduce computational complexity without sacrificing semantic relevance. The framework employs a hierarchical analysis approach, where a lightweight 2D CNN model identifies user-preferred content from thumbnails and generates timestamps to create fast-forward summaries. Our interactive demo highlights the system's ability to create tailored video summaries for long-form videos, such as movies, sports events, and TV shows, based on individual user preferences. The entire computation occurs seamlessly on resource-constrained devices like Jetson Nano, demonstrating how EdgeVidSum addresses the critical challenges of computational efficiency, personalization, and privacy in modern video consumption environments.

Related papers

FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding [17.71123451197036]
complexity of video data and contextual processing limitations still hinder long-video comprehension.<n>We propose FiLA-Video, a novel framework that integrates multiple frames into a single representation.<n>FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.
arXiv Detail & Related papers (2025-04-29T03:09:46Z)
Multi-subject Open-set Personalization in Video Generation [110.02124633005516]
We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities.<n>Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt.<n>Our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-01-10T18:59:54Z)
Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos. Our model uses a novel autoregressive factorized decoding architecture. Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z)
CSTA: CNN-based Spatiotemporal Attention for Video Summarization [0.24578723416255752]
We propose a CNN-based SpatioTemporal Attention (CSTA) method that stacks each feature of frames from a single video to form image-like frame representations. Our methodology relies on CNN to comprehend the inter and intra-frame relations and to find crucial attributes in videos by exploiting its ability to learn absolute positions within images.
arXiv Detail & Related papers (2024-05-20T09:38:37Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.<n>We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z)
Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects. Our framework is a non-trivial adaptation from image generation methods, and is new to this field. Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z)
LTC-SUM: Lightweight Client-driven Personalized Video Summarization Framework Using 2D CNN [5.95248889179516]
This paper proposes a novel lightweight thumbnail container-based summarization (LTC-SUM) framework for full feature-length videos. It generates a personalized keyshot summary for concurrent users by using the computational resource of the end-user device.
arXiv Detail & Related papers (2022-01-22T13:54:13Z)
Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos. We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z)
A Sparse Sampling-based framework for Semantic Fast-Forward of First-Person Videos [2.362412515574206]
Most uploaded videos are doomed to be forgotten and unwatched stashed away in some computer folder or website. We present a new adaptive frame selection formulated as a weighted minimum reconstruction problem. Our method is able to retain as much relevant information and smoothness as the state-of-the-art techniques, but in less processing time.
arXiv Detail & Related papers (2020-09-21T18:36:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.