Related papers: Personalized Video Summarization by Multimodal Video Understanding

Personalized Video Summarization by Multimodal Video Understanding

URL: http://arxiv.org/abs/2411.03531v1
Date: Tue, 05 Nov 2024 22:14:35 GMT
Title: Personalized Video Summarization by Multimodal Video Understanding
Authors: Brian Chen, Xiangyuan Zhao, Yingnan Zhu,
Abstract summary: We present a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization. VSL is based on pre-trained visual language models (VLMs) to avoid the need to train a video summarization system on a large training dataset. We show that our method is more adaptable across different datasets compared to supervised query-based video summarization models.
Score: 2.1372652192505703
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video summarization techniques have been proven to improve the overall user experience when it comes to accessing and comprehending video content. If the user's preference is known, video summarization can identify significant information or relevant content from an input video, aiding them in obtaining the necessary information or determining their interest in watching the original video. Adapting video summarization to various types of video and user preferences requires significant training data and expensive human labeling. To facilitate such research, we proposed a new benchmark for video summarization that captures various user preferences. Also, we present a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization that is based on pre-trained visual language models (VLMs) to avoid the need to train a video summarization system on a large training dataset. The pipeline takes both video and closed captioning as input and performs semantic analysis at the scene level by converting video frames into text. Subsequently, the user's genre preference was used as the basis for selecting the pertinent textual scenes. The experimental results demonstrate that our proposed pipeline outperforms current state-of-the-art unsupervised video summarization models. We show that our method is more adaptable across different datasets compared to supervised query-based video summarization models. In the end, the runtime analysis demonstrates that our pipeline is more suitable for practical use when scaling up the number of user preferences and videos.

Related papers

HIPPO-Video: Simulating Watch Histories with Large Language Models for Personalized Video Highlighting [27.92094212778288]
We introduce HIPPO-Video, a novel dataset for personalized video highlighting.<n>The dataset includes 2,040 (watch history, saliency score) pairs, covering 20,400 videos across 170 semantic categories.<n>To validate our dataset, we propose HiPHer, a method that leverages these personalized watch histories to predict preference-conditioned segment-wise saliency scores.
arXiv Detail & Related papers (2025-07-22T08:24:33Z)
MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos [77.59558834294134]
We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames.<n>We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap.<n>We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours.
arXiv Detail & Related papers (2025-06-14T20:39:32Z)
Unsupervised Transcript-assisted Video Summarization and Highlight Detection [6.80224810039938]
We propose a multimodal pipeline that leverages video frames and their corresponding transcripts to generate a more condensed version of the video.<n>The pipeline is trained within an RL framework, which rewards the model for generating diverse and representative summaries.<n>Our experiments show that using the transcript in video summarization and highlight detection achieves superior results compared to relying solely on the visual content of the video.
arXiv Detail & Related papers (2025-05-29T09:16:19Z)
Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric [1.9774761182870912]
We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. We conduct experiments on the YouCook2 benchmark, showing promising retrieval performance.
arXiv Detail & Related papers (2025-04-06T18:18:09Z)
Agent-based Video Trimming [17.519404251018308]
We introduce a novel task called Video Trimming (VT) VT focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story. AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task.
arXiv Detail & Related papers (2024-12-12T17:59:28Z)
Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset. We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z)
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval [43.58794386905177]
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime. This neglects the richness and variety of possible valid descriptions of a video. We propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
arXiv Detail & Related papers (2023-11-30T18:59:45Z)
DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips. The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z)
Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query. Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z)
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels. contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text. There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos. We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z)
CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization. We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another. Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z)
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. We evaluate various baseline methods with and without large-scale VidL pre-training. The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization [18.543372365239673]
The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator. Results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method.
arXiv Detail & Related papers (2021-04-26T10:50:37Z)
Straight to the Point: Fast-forwarding Videos via Reinforcement Learning Using Textual Data [1.004766879203303]
We present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos. Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video. We propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space.
arXiv Detail & Related papers (2020-03-31T14:07:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.