SumGraph: Video Summarization via Recursive Graph Modeling
- URL: http://arxiv.org/abs/2007.08809v1
- Date: Fri, 17 Jul 2020 08:11:30 GMT
- Title: SumGraph: Video Summarization via Recursive Graph Modeling
- Authors: Jungin Park, Jiyoung Lee, Ig-Jae Kim, and Kwanghoon Sohn
- Abstract summary: We propose graph modeling networks for video summarization, termed SumGraph, to represent a relation graph.
We achieve state-of-the-art performance on several benchmarks for video summarization in both supervised and unsupervised manners.
- Score: 59.01856443537622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of video summarization is to select keyframes that are visually
diverse and can represent a whole story of an input video. State-of-the-art
approaches for video summarization have mostly regarded the task as a
frame-wise keyframe selection problem by aggregating all frames with equal
weight. However, to find informative parts of the video, it is necessary to
consider how all the frames of the video are related to each other. To this
end, we cast video summarization as a graph modeling problem. We propose
recursive graph modeling networks for video summarization, termed SumGraph, to
represent a relation graph, where frames are regarded as nodes and nodes are
connected by semantic relationships among frames. Our networks accomplish this
through a recursive approach to refine an initially estimated graph to
correctly classify each node as a keyframe by reasoning the graph
representation via graph convolutional networks. To leverage SumGraph in a more
practical environment, we also present a way to adapt our graph modeling in an
unsupervised fashion. With SumGraph, we achieved state-of-the-art performance
on several benchmarks for video summarization in both supervised and
unsupervised manners.
Related papers
- VideoSAGE: Video Summarization with Graph Representation Learning [9.21019970479227]
We propose a graph-based representation learning framework for video summarization.
A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck.
arXiv Detail & Related papers (2024-04-14T15:49:02Z) - Multi-Granularity Graph Pooling for Video-based Person Re-Identification [14.943835935921296]
graph neural networks (GNNs) are introduced to aggregate temporal and spatial features of video samples.
Existing graph-based models, like STGCN, perform the textitmean/textitmax pooling on node features to obtain the graph representation.
We propose the graph pooling network (GPNet) to learn the multi-granularity graph representation for the video retrieval.
arXiv Detail & Related papers (2022-09-23T13:26:05Z) - Edge but not Least: Cross-View Graph Pooling [76.71497833616024]
This paper presents a cross-view graph pooling (Co-Pooling) method to better exploit crucial graph structure information.
Through cross-view interaction, edge-view pooling and node-view pooling seamlessly reinforce each other to learn more informative graph-level representations.
arXiv Detail & Related papers (2021-09-24T08:01:23Z) - Reconstructive Sequence-Graph Network for Video Summarization [107.0328985865372]
Exploiting the inner-shot and inter-shot dependencies is essential for key-shot based video summarization.
We propose a Reconstructive Sequence-Graph Network (RSGN) to encode the frames and shots as sequence and graph hierarchically.
A reconstructor is developed to reward the summary generator, so that the generator can be optimized in an unsupervised manner.
arXiv Detail & Related papers (2021-05-10T01:47:55Z) - Accurate Learning of Graph Representations with Graph Multiset Pooling [45.72542969364438]
We propose a Graph Multiset Transformer (GMT) that captures the interaction between nodes according to their structural dependencies.
Our experimental results show that GMT significantly outperforms state-of-the-art graph pooling methods on graph classification benchmarks.
arXiv Detail & Related papers (2021-02-23T07:45:58Z) - Multilevel Graph Matching Networks for Deep Graph Similarity Learning [79.3213351477689]
We propose a multi-level graph matching network (MGMN) framework for computing the graph similarity between any pair of graph-structured objects.
To compensate for the lack of standard benchmark datasets, we have created and collected a set of datasets for both the graph-graph classification and graph-graph regression tasks.
Comprehensive experiments demonstrate that MGMN consistently outperforms state-of-the-art baseline models on both the graph-graph classification and graph-graph regression tasks.
arXiv Detail & Related papers (2020-07-08T19:48:19Z) - Comprehensive Information Integration Modeling Framework for Video
Titling [124.11296128308396]
We integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework.
To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization.
We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform.
arXiv Detail & Related papers (2020-06-24T10:38:15Z) - Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS)
AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges.
Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z) - Cut-Based Graph Learning Networks to Discover Compositional Structure of
Sequential Video Data [29.841574293529796]
We propose Cut-Based Graph Learning Networks (CB-GLNs) for learning video data by discovering complex structures of the video.
CB-GLNs represent video data as a graph, with nodes and edges corresponding to frames of the video and their dependencies respectively.
We evaluate the proposed method on the two different tasks for video understanding: Video theme classification (Youtube-8M dataset) and Video Question and Answering (TVQA dataset)
arXiv Detail & Related papers (2020-01-17T10:09:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.