Related papers: Masked Autoencoder for Unsupervised Video Summarization

Masked Autoencoder for Unsupervised Video Summarization

URL: http://arxiv.org/abs/2306.01395v1
Date: Fri, 2 Jun 2023 09:44:45 GMT
Title: Masked Autoencoder for Unsupervised Video Summarization
Authors: Minho Shim, Taeoh Kim, Jinhyung Kim, Dongyoon Wee
Abstract summary: Self-supervised learning (SSL) is acknowledged for its robustness and flexibility to multiple downstream tasks. We claim an unsupervised autoencoder with sufficient self-supervised learning does not need any extra downstream architecture design or fine-tuning weights to be utilized as a video summarization model. We evaluate the method in major unsupervised video summarization benchmarks to show its effectiveness under various experimental settings.
Score: 10.853922245706716
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Summarizing a video requires a diverse understanding of the video, ranging from recognizing scenes to evaluating how much each frame is essential enough to be selected as a summary. Self-supervised learning (SSL) is acknowledged for its robustness and flexibility to multiple downstream tasks, but the video SSL has not shown its value for dense understanding tasks like video summarization. We claim an unsupervised autoencoder with sufficient self-supervised learning does not need any extra downstream architecture design or fine-tuning weights to be utilized as a video summarization model. The proposed method to evaluate the importance score of each frame takes advantage of the reconstruction score of the autoencoder's decoder. We evaluate the method in major unsupervised video summarization benchmarks to show its effectiveness under various experimental settings.

Related papers

Unsupervised Transcript-assisted Video Summarization and Highlight Detection [6.80224810039938]
We propose a multimodal pipeline that leverages video frames and their corresponding transcripts to generate a more condensed version of the video.<n>The pipeline is trained within an RL framework, which rewards the model for generating diverse and representative summaries.<n>Our experiments show that using the transcript in video summarization and highlight detection achieves superior results compared to relying solely on the visual content of the video.
arXiv Detail & Related papers (2025-05-29T09:16:19Z)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z)
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs) We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z)
Active Learning for Video Classification with Frame Level Queries [13.135234328352885]
We propose a novel active learning framework for video classification. Our framework identifies a batch of exemplar videos, together with a set of informative frames for each video. This involves much less manual work than watching the complete video to come up with a label.
arXiv Detail & Related papers (2023-07-10T15:47:13Z)
SELF-VS: Self-supervised Encoding Learning For Video Summarization [6.21295508577576]
We propose a novel self-supervised video representation learning method using knowledge distillation to pre-train a transformer encoder. Our method matches its semantic video representation, which is constructed with respect to frame importance scores, to a representation derived from a CNN trained on video classification.
arXiv Detail & Related papers (2023-03-28T14:08:05Z)
Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization [27.312423653997087]
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. We propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness. We show that by refining the pre-trained features with a lightweight contrastively learned projection module, the frame-level importance scores can be further improved.
arXiv Detail & Related papers (2022-11-18T07:01:28Z)
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization. Existing video summarization datasets rely on manual frame-level annotations. We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z)
A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs) The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved. We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z)
Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos. We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z)
ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information. Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other. In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z)
Multiview Pseudo-Labeling for Semi-supervised Learning from Video [102.36355560553402]
We present a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video. Our method capitalizes on multiple views, but it nonetheless trains a model that is shared across appearance and motion input. On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
arXiv Detail & Related papers (2021-04-01T17:59:48Z)
Self-supervised Video Representation Learning by Pace Prediction [48.029602040786685]
This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace. We randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip.
arXiv Detail & Related papers (2020-08-13T12:40:24Z)
Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework. We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.