SELF-VS: Self-supervised Encoding Learning For Video Summarization
- URL: http://arxiv.org/abs/2303.15993v1
- Date: Tue, 28 Mar 2023 14:08:05 GMT
- Title: SELF-VS: Self-supervised Encoding Learning For Video Summarization
- Authors: Hojjat Mokhtarabadi, Kave Bahraman, Mehrdad HosseinZadeh, Mahdi
Eftekhari
- Abstract summary: We propose a novel self-supervised video representation learning method using knowledge distillation to pre-train a transformer encoder.
Our method matches its semantic video representation, which is constructed with respect to frame importance scores, to a representation derived from a CNN trained on video classification.
- Score: 6.21295508577576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite its wide range of applications, video summarization is still held
back by the scarcity of extensive datasets, largely due to the labor-intensive
and costly nature of frame-level annotations. As a result, existing video
summarization methods are prone to overfitting. To mitigate this challenge, we
propose a novel self-supervised video representation learning method using
knowledge distillation to pre-train a transformer encoder. Our method matches
its semantic video representation, which is constructed with respect to frame
importance scores, to a representation derived from a CNN trained on video
classification. Empirical evaluations on correlation-based metrics, such as
Kendall's $\tau$ and Spearman's $\rho$ demonstrate the superiority of our
approach compared to existing state-of-the-art methods in assigning relative
scores to the input frames.
Related papers
- VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z) - Masked Autoencoder for Unsupervised Video Summarization [10.853922245706716]
Self-supervised learning (SSL) is acknowledged for its robustness and flexibility to multiple downstream tasks.
We claim an unsupervised autoencoder with sufficient self-supervised learning does not need any extra downstream architecture design or fine-tuning weights to be utilized as a video summarization model.
We evaluate the method in major unsupervised video summarization benchmarks to show its effectiveness under various experimental settings.
arXiv Detail & Related papers (2023-06-02T09:44:45Z) - Contrastive Losses Are Natural Criteria for Unsupervised Video
Summarization [27.312423653997087]
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing.
We propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness.
We show that by refining the pre-trained features with a lightweight contrastively learned projection module, the frame-level importance scores can be further improved.
arXiv Detail & Related papers (2022-11-18T07:01:28Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - Self-Supervised Video Representation Learning by Video Incoherence
Detection [28.540645395066434]
This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning.
It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos.
arXiv Detail & Related papers (2021-09-26T04:58:13Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.