MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization
- URL: http://arxiv.org/abs/2204.08352v3
- Date: Tue, 30 May 2023 09:10:02 GMT
- Title: MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization
- Authors: Wujiang Xu, Runzhong Wang, Xiaobo Guo, Shaoshuai Li, Qiongxu Ma, Yunan
Zhao, Sheng Guo, Zhenfeng Zhu, Junchi Yan
- Abstract summary: We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
- Score: 61.69587867308656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video summarization intends to produce a concise video summary by effectively
capturing and combining the most informative parts of the whole content.
Existing approaches for video summarization regard the task as a frame-wise
keyframe selection problem and generally construct the frame-wise
representation by combining the long-range temporal dependency with the
unimodal or bimodal information. However, the optimal video summaries need to
reflect the most valuable keyframe with its own information, and one with
semantic power of the whole content. Thus, it is critical to construct a more
powerful and robust frame-wise representation and predict the frame-level
importance score in a fair and comprehensive manner. To tackle the above
issues, we propose a multimodal hierarchical shot-aware convolutional network,
denoted as MHSCNet, to enhance the frame-wise representation via combining the
comprehensive available multimodal information. Specifically, we design a
hierarchical ShotConv network to incorporate the adaptive shot-aware
frame-level representation by considering the short-range and long-range
temporal dependency. Based on the learned shot-aware representations, MHSCNet
can predict the frame-level importance score in the local and global view of
the video. Extensive experiments on two standard video summarization datasets
demonstrate that our proposed method consistently outperforms state-of-the-art
baselines. Source code will be made publicly available.
Related papers
- UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos [52.161513027831646]
We focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV)
We propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously.
Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines.
arXiv Detail & Related papers (2024-06-24T03:55:25Z) - A Challenging Multimodal Video Summary: Simultaneously Extracting and
Generating Keyframe-Caption Pairs from Video [20.579167394855197]
This paper proposes a practical multimodal video summarization task setting and dataset to train and evaluate the task.
The target task involves summarizing a given video into a number ofcaption pairs and displaying them in a listable format to grasp the video content quickly.
This task is useful as a practical application and presents a highly challenging problem worthy of study.
arXiv Detail & Related papers (2023-12-04T02:17:14Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z) - DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video
Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS.
DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence.
We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z) - Reconstructive Sequence-Graph Network for Video Summarization [107.0328985865372]
Exploiting the inner-shot and inter-shot dependencies is essential for key-shot based video summarization.
We propose a Reconstructive Sequence-Graph Network (RSGN) to encode the frames and shots as sequence and graph hierarchically.
A reconstructor is developed to reward the summary generator, so that the generator can be optimized in an unsupervised manner.
arXiv Detail & Related papers (2021-05-10T01:47:55Z) - Exploring global diverse attention via pairwise temporal relation for
video summarization [84.28263235895798]
We propose an efficient convolutional neural network architecture for video SUMmarization via Global Diverse Attention.
The proposed models can be run in parallel with significantly less computational costs.
arXiv Detail & Related papers (2020-09-23T06:29:09Z) - Transforming Multi-Concept Attention into Video Summarization [36.85535624026879]
We propose a novel attention-based framework for video summarization with complex video data.
Our model can be applied to both labeled and unlabeled data, making our method preferable to real-world applications.
arXiv Detail & Related papers (2020-06-02T06:23:50Z) - SummaryNet: A Multi-Stage Deep Learning Model for Automatic Video
Summarisation [0.0]
We introduce SummaryNet as a supervised learning framework for automated video summarisation.
It employs a two-stream convolutional network to learn spatial (appearance) and temporal (motion) representations.
arXiv Detail & Related papers (2020-02-19T18:24:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.