Unsupervised Video Summarization with a Convolutional Attentive
Adversarial Network
- URL: http://arxiv.org/abs/2105.11131v1
- Date: Mon, 24 May 2021 07:24:39 GMT
- Title: Unsupervised Video Summarization with a Convolutional Attentive
Adversarial Network
- Authors: Guoqiang Liang, Yanbing Lv, Shucheng Li, Shizhou Zhang, Yanning Zhang
- Abstract summary: We propose a convolutional attentive adversarial network (CAAN) to build a deep summarizer in an unsupervised way.
Specifically, the generator employs a fully convolutional sequence network to extract global representation of a video, and an attention-based network to output normalized importance scores.
The results show the superiority of our proposed method against other state-of-the-art unsupervised approaches.
- Score: 32.90753137435032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the explosive growth of video data, video summarization, which attempts
to seek the minimum subset of frames while still conveying the main story, has
become one of the hottest topics. Nowadays, substantial achievements have been
made by supervised learning techniques, especially after the emergence of deep
learning. However, it is extremely expensive and difficult to collect human
annotation for large-scale video datasets. To address this problem, we propose
a convolutional attentive adversarial network (CAAN), whose key idea is to
build a deep summarizer in an unsupervised way. Upon the generative adversarial
network, our overall framework consists of a generator and a discriminator. The
former predicts importance scores for all frames of a video while the latter
tries to distinguish the score-weighted frame features from original frame
features. Specifically, the generator employs a fully convolutional sequence
network to extract global representation of a video, and an attention-based
network to output normalized importance scores. To learn the parameters, our
objective function is composed of three loss functions, which can guide the
frame-level importance score prediction collaboratively. To validate this
proposed method, we have conducted extensive experiments on two public
benchmarks SumMe and TVSum. The results show the superiority of our proposed
method against other state-of-the-art unsupervised approaches. Our method even
outperforms some published supervised approaches.
Related papers
- Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting [87.11995635760108]
Key to action counting is accurately locating each video's repetitive actions.
We propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner.
arXiv Detail & Related papers (2024-06-13T05:15:52Z) - Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z) - Contrastive Losses Are Natural Criteria for Unsupervised Video
Summarization [27.312423653997087]
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing.
We propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness.
We show that by refining the pre-trained features with a lightweight contrastively learned projection module, the frame-level importance scores can be further improved.
arXiv Detail & Related papers (2022-11-18T07:01:28Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z) - Adversarial Memory Networks for Action Prediction [95.09968654228372]
Action prediction aims to infer the forthcoming human action with partially-observed videos.
We propose adversarial memory networks (AMemNet) to generate the "full video" feature conditioning on a partial video query.
arXiv Detail & Related papers (2021-12-18T08:16:21Z) - Unsupervised Video Summarization via Multi-source Features [4.387757291346397]
Video summarization aims at generating a compact yet representative visual summary that conveys the essence of the original video.
We propose the incorporation of multiple feature sources with chunk and stride fusion to provide more information about the visual content.
For a comprehensive evaluation on the two benchmarks TVSum and SumMe, we compare our method with four state-of-the-art approaches.
arXiv Detail & Related papers (2021-05-26T13:12:46Z) - Efficient Video Summarization Framework using EEG and Eye-tracking
Signals [0.92246583941469]
This paper proposes an efficient video summarization framework that will give a gist of the entire video in a few key-frames or video skims.
To understand human attention behavior, we have designed and performed experiments with human participants using electroencephalogram (EEG) and eye-tracking technology.
Using our approach, a video is summarized by 96.5% while maintaining higher precision and high recall factors.
arXiv Detail & Related papers (2021-01-27T08:13:19Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z) - See More, Know More: Unsupervised Video Object Segmentation with
Co-Attention Siamese Networks [184.4379622593225]
We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task.
We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism.
We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos.
arXiv Detail & Related papers (2020-01-19T11:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.