GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video
Summarization
- URL: http://arxiv.org/abs/2104.12465v1
- Date: Mon, 26 Apr 2021 10:50:37 GMT
- Title: GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video
Summarization
- Authors: Jia-Hong Huang, Luka Murn, Marta Mrak, Marcel Worring
- Abstract summary: The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator.
Results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method.
- Score: 18.543372365239673
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional video summarization methods generate fixed video representations
regardless of user interest. Therefore such methods limit users' expectations
in content search and exploration scenarios. Multi-modal video summarization is
one of the methods utilized to address this problem. When multi-modal video
summarization is used to help video exploration, a text-based query is
considered as one of the main drivers of video summary generation, as it is
user-defined. Thus, encoding the text-based query and the video effectively are
both important for the task of multi-modal video summarization. In this work, a
new method is proposed that uses a specialized attention network and
contextualized word representations to tackle this task. The proposed model
consists of a contextualized video summary controller, multi-modal attention
mechanisms, an interactive attention network, and a video summary generator.
Based on the evaluation of the existing multi-modal video summarization
benchmark, experimental results show that the proposed model is effective with
the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared
with the state-of-the-art method.
Related papers
- Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm.
Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z) - UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos [52.161513027831646]
We focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV)
We propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously.
Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines.
arXiv Detail & Related papers (2024-06-24T03:55:25Z) - V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning [76.26890864487933]
Video summarization aims to create short, accurate, and cohesive summaries of longer videos.
Most existing datasets are created for video-to-video summarization.
Recent efforts have been made to expand from unimodal to multimodal video summarization.
arXiv Detail & Related papers (2024-04-18T17:32:46Z) - Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - Joint Multimedia Event Extraction from Video and Article [51.159034070824056]
We propose the first approach to jointly extract events from video and text articles.
First, we propose the first self-supervised multimodal event coreference model.
Second, we introduce the first multimodal transformer which extracts structured event information jointly from both videos and text documents.
arXiv Detail & Related papers (2021-09-27T03:22:12Z) - Query-controllable Video Summarization [16.54586273670312]
We introduce a method which takes a text-based query as input and generates a video summary corresponding to it.
Our proposed method consists of a video summary controller, video summary generator, and video summary output module.
arXiv Detail & Related papers (2020-04-07T19:35:04Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.