Learning to Summarize Videos by Contrasting Clips
- URL: http://arxiv.org/abs/2301.05213v3
- Date: Wed, 19 Apr 2023 12:09:12 GMT
- Title: Learning to Summarize Videos by Contrasting Clips
- Authors: Ivan Sosnovik, Artem Moskalev, Cees Kaandorp, Arnold Smeulders
- Abstract summary: Video summarization aims at choosing parts of a video that narrate a story as close as possible to the original one.
Most of the existing video summarization approaches focus on hand-crafted labels.
We propose contrastive learning as the answer to both questions.
- Score: 1.3999481573773074
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video summarization aims at choosing parts of a video that narrate a story as
close as possible to the original one. Most of the existing video summarization
approaches focus on hand-crafted labels. As the number of videos grows
exponentially, there emerges an increasing need for methods that can learn
meaningful summarizations without labeled annotations. In this paper, we aim to
maximally exploit unsupervised video summarization while concentrating the
supervision to a few, personalized labels as an add-on. To do so, we formulate
the key requirements for the informative video summarization. Then, we propose
contrastive learning as the answer to both questions. To further boost
Contrastive video Summarization (CSUM), we propose to contrast top-k features
instead of a mean video feature as employed by the existing method, which we
implement with a differentiable top-k feature selector. Our experiments on
several benchmarks demonstrate, that our approach allows for meaningful and
diverse summaries when no labeled data is provided.
Related papers
- Shot2Story20K: A New Benchmark for Comprehensive Understanding of
Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - Query-based Video Summarization with Pseudo Label Supervision [19.229722872058055]
Existing datasets for manually labelled query-based video summarization are costly and thus small.
Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels.
Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-04T22:28:17Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z) - Straight to the Point: Fast-forwarding Videos via Reinforcement Learning
Using Textual Data [1.004766879203303]
We present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos.
Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video.
We propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space.
arXiv Detail & Related papers (2020-03-31T14:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.