Reconstructive Sequence-Graph Network for Video Summarization
- URL: http://arxiv.org/abs/2105.04066v1
- Date: Mon, 10 May 2021 01:47:55 GMT
- Title: Reconstructive Sequence-Graph Network for Video Summarization
- Authors: Bin Zhao, Haopeng Li, Xiaoqiang Lu, Xuelong Li
- Abstract summary: Exploiting the inner-shot and inter-shot dependencies is essential for key-shot based video summarization.
We propose a Reconstructive Sequence-Graph Network (RSGN) to encode the frames and shots as sequence and graph hierarchically.
A reconstructor is developed to reward the summary generator, so that the generator can be optimized in an unsupervised manner.
- Score: 107.0328985865372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploiting the inner-shot and inter-shot dependencies is essential for
key-shot based video summarization. Current approaches mainly devote to
modeling the video as a frame sequence by recurrent neural networks. However,
one potential limitation of the sequence models is that they focus on capturing
local neighborhood dependencies while the high-order dependencies in long
distance are not fully exploited. In general, the frames in each shot record a
certain activity and vary smoothly over time, but the multi-hop relationships
occur frequently among shots. In this case, both the local and global
dependencies are important for understanding the video content. Motivated by
this point, we propose a Reconstructive Sequence-Graph Network (RSGN) to encode
the frames and shots as sequence and graph hierarchically, where the
frame-level dependencies are encoded by Long Short-Term Memory (LSTM), and the
shot-level dependencies are captured by the Graph Convolutional Network (GCN).
Then, the videos are summarized by exploiting both the local and global
dependencies among shots. Besides, a reconstructor is developed to reward the
summary generator, so that the generator can be optimized in an unsupervised
manner, which can avert the lack of annotated data in video summarization.
Furthermore, under the guidance of reconstruction loss, the predicted summary
can better preserve the main video content and shot-level dependencies.
Practically, the experimental results on three popular datasets i.e., SumMe,
TVsum and VTW) have demonstrated the superiority of our proposed approach to
the summarization task.
Related papers
- Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z) - Recurrence-in-Recurrence Networks for Video Deblurring [58.49075799159015]
State-of-the-art video deblurring methods often adopt recurrent neural networks to model the temporal dependency between the frames.
In this paper, we propose recurrence-in-recurrence network architecture to cope with the limitations of short-ranged memory.
arXiv Detail & Related papers (2022-03-12T11:58:13Z) - Video Is Graph: Structured Graph Module for Video Action Recognition [34.918667614077805]
We transform a video sequence into a graph to obtain direct long-term dependencies among temporal frames.
In particular, SGM divides the neighbors of each node into several temporal regions so as to extract global structural information.
The reported performance and analysis demonstrate that SGM can achieve outstanding precision with less computational complexity.
arXiv Detail & Related papers (2021-10-12T11:27:29Z) - Unsupervised Video Summarization with a Convolutional Attentive
Adversarial Network [32.90753137435032]
We propose a convolutional attentive adversarial network (CAAN) to build a deep summarizer in an unsupervised way.
Specifically, the generator employs a fully convolutional sequence network to extract global representation of a video, and an attention-based network to output normalized importance scores.
The results show the superiority of our proposed method against other state-of-the-art unsupervised approaches.
arXiv Detail & Related papers (2021-05-24T07:24:39Z) - SumGraph: Video Summarization via Recursive Graph Modeling [59.01856443537622]
We propose graph modeling networks for video summarization, termed SumGraph, to represent a relation graph.
We achieve state-of-the-art performance on several benchmarks for video summarization in both supervised and unsupervised manners.
arXiv Detail & Related papers (2020-07-17T08:11:30Z) - Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS)
AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges.
Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z) - Cut-Based Graph Learning Networks to Discover Compositional Structure of
Sequential Video Data [29.841574293529796]
We propose Cut-Based Graph Learning Networks (CB-GLNs) for learning video data by discovering complex structures of the video.
CB-GLNs represent video data as a graph, with nodes and edges corresponding to frames of the video and their dependencies respectively.
We evaluate the proposed method on the two different tasks for video understanding: Video theme classification (Youtube-8M dataset) and Video Question and Answering (TVQA dataset)
arXiv Detail & Related papers (2020-01-17T10:09:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.