Query-based Video Summarization with Pseudo Label Supervision
- URL: http://arxiv.org/abs/2307.01945v1
- Date: Tue, 4 Jul 2023 22:28:17 GMT
- Title: Query-based Video Summarization with Pseudo Label Supervision
- Authors: Jia-Hong Huang, Luka Murn, Marta Mrak, Marcel Worring
- Abstract summary: Existing datasets for manually labelled query-based video summarization are costly and thus small.
Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels.
Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.
- Score: 19.229722872058055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing datasets for manually labelled query-based video summarization are
costly and thus small, limiting the performance of supervised deep video
summarization models. Self-supervision can address the data sparsity challenge
by using a pretext task and defining a method to acquire extra data with pseudo
labels to pre-train a supervised deep model. In this work, we introduce
segment-level pseudo labels from input videos to properly model both the
relationship between a pretext task and a target task, and the implicit
relationship between the pseudo label and the human-defined label. The pseudo
labels are generated based on existing human-defined frame-level labels. To
create more accurate query-dependent video summaries, a semantics booster is
proposed to generate context-aware query representations. Furthermore, we
propose mutual attention to help capture the interactive information between
visual and textual modalities. Three commonly-used video summarization
benchmarks are used to thoroughly validate the proposed approach. Experimental
results show that the proposed video summarization algorithm achieves
state-of-the-art performance.
Related papers
- Your Interest, Your Summaries: Query-Focused Long Video Summarization [0.6041235048439966]
This paper introduces an approach for query-focused video summarization, aiming to align video summaries closely with user queries.
We propose the Fully Convolutional Sequence Network with Attention (FCSNA-QFVS), a novel approach designed for this task.
arXiv Detail & Related papers (2024-10-17T23:37:58Z) - Personalized Video Summarization using Text-Based Queries and Conditional Modeling [3.4447129363520337]
This thesis explores enhancing video summarization by integrating text-based queries and conditional modeling.
Evaluation metrics such as accuracy and F1-score assess the quality of the generated summaries.
arXiv Detail & Related papers (2024-08-27T02:43:40Z) - Weakly Supervised Video Individual CountingWeakly Supervised Video
Individual Counting [126.75545291243142]
Video Individual Counting aims to predict the number of unique individuals in a single video.
We introduce a weakly supervised VIC task, wherein trajectory labels are not provided.
In doing so, we devise an end-to-end trainable soft contrastive loss to drive the network to distinguish inflow, outflow, and the remaining.
arXiv Detail & Related papers (2023-12-10T16:12:13Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - Learning to Summarize Videos by Contrasting Clips [1.3999481573773074]
Video summarization aims at choosing parts of a video that narrate a story as close as possible to the original one.
Most of the existing video summarization approaches focus on hand-crafted labels.
We propose contrastive learning as the answer to both questions.
arXiv Detail & Related papers (2023-01-12T18:55:30Z) - Text Summarization with Oracle Expectation [88.39032981994535]
Extractive summarization produces summaries by identifying and concatenating the most important sentences in a document.
Most summarization datasets do not come with gold labels indicating whether document sentences are summary-worthy.
We propose a simple yet effective labeling algorithm that creates soft, expectation-based sentence labels.
arXiv Detail & Related papers (2022-09-26T14:10:08Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - CycAs: Self-supervised Cycle Association for Learning Re-identifiable
Descriptions [61.724894233252414]
This paper proposes a self-supervised learning method for the person re-identification (re-ID) problem.
Existing unsupervised methods usually rely on pseudo labels, such as those from video tracklets or clustering.
We introduce a different unsupervised method that allows us to learn pedestrian embeddings from raw videos, without resorting to pseudo labels.
arXiv Detail & Related papers (2020-07-15T09:52:35Z) - Labelling unlabelled videos from scratch with multi-modal
self-supervision [82.60652426371936]
unsupervised labelling of a video dataset does not come for free from strong feature encoders.
We propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations.
An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels.
arXiv Detail & Related papers (2020-06-24T12:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.