Video Summarization through Reinforcement Learning with a 3D
Spatio-Temporal U-Net
- URL: http://arxiv.org/abs/2106.10528v1
- Date: Sat, 19 Jun 2021 16:27:19 GMT
- Title: Video Summarization through Reinforcement Learning with a 3D
Spatio-Temporal U-Net
- Authors: Tianrui Liu, Qingjie Meng, Jun-Jie Huang, Athanasios Vlontzos, Daniel
Rueckert, Bernhard Kainz
- Abstract summary: We introduce 3DST-UNet-RL framework for video summarization.
We show experimental evidence for the effectiveness of 3DST-UNet-RL on two commonly used general video summarization benchmarks.
The proposed video summarization has the potential to save storage costs of ultrasound screening videos as well as to increase efficiency when browsing patient video data during retrospective analysis.
- Score: 15.032516344808526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Intelligent video summarization algorithms allow to quickly convey the most
relevant information in videos through the identification of the most essential
and explanatory content while removing redundant video frames. In this paper,
we introduce the 3DST-UNet-RL framework for video summarization. A 3D
spatio-temporal U-Net is used to efficiently encode spatio-temporal information
of the input videos for downstream reinforcement learning (RL). An RL agent
learns from spatio-temporal latent scores and predicts actions for keeping or
rejecting a video frame in a video summary. We investigate if real/inflated 3D
spatio-temporal CNN features are better suited to learn representations from
videos than commonly used 2D image features. Our framework can operate in both,
a fully unsupervised mode and a supervised training mode. We analyse the impact
of prescribed summary lengths and show experimental evidence for the
effectiveness of 3DST-UNet-RL on two commonly used general video summarization
benchmarks. We also applied our method on a medical video summarization task.
The proposed video summarization method has the potential to save storage costs
of ultrasound screening videos as well as to increase efficiency when browsing
patient video data during retrospective analysis or audit without loosing
essential information
Related papers
- SEAL: Semantic Attention Learning for Long Video Representation [31.994155533019843]
This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos.
To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities.
Our representation is versatile, enabling applications across various long video understanding tasks.
arXiv Detail & Related papers (2024-12-02T18:46:12Z) - T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs [102.66246727371583]
We develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus.
We find that the proposed scheme can boost the performance of long video understanding without training with long video samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships.
We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains.
LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z) - Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement
Learning Method [6.172652648945223]
This paper presents a novel weakly-supervised methodology to accelerate instructional videos using text.
A novel joint reward function guides our agent to select which frames to remove and reduce the input video to a target length.
We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space.
arXiv Detail & Related papers (2022-03-29T17:43:01Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Video 3D Sampling for Self-supervised Representation Learning [13.135859819622855]
We propose a novel self-supervised method for video representation learning, referred to as Video 3D Sampling (V3S)
In our implementation, we combine the sampling of the three dimensions and propose the scale and projection transformations in space and time respectively.
The experimental results show that, when applied to action recognition, video retrieval and action similarity labeling, our approach improves the state-of-the-arts with significant margins.
arXiv Detail & Related papers (2021-07-08T03:22:06Z) - Temporal Stochastic Softmax for 3D CNNs: An Application in Facial
Expression Recognition [11.517316695930596]
We present a strategy for efficient video-based training of 3D CNNs.
It relies on softmax temporal pooling and a weighted sampling mechanism to select the most relevant training clips.
arXiv Detail & Related papers (2020-11-10T16:40:00Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Spatiotemporal Contrastive Video Representation Learning [87.56145031149869]
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn visual representations from unlabeled videos.
Our representations are learned using a contrasttemporalive loss, where two augmented clips from the same short video are pulled together in the embedding space.
We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial.
arXiv Detail & Related papers (2020-08-09T19:58:45Z) - Exploring Relations in Untrimmed Videos for Self-Supervised Learning [17.670226952829506]
Existing self-supervised learning methods mainly rely on trimmed videos for model training.
We propose a novel self-supervised method, referred to as Exploring Relations in Untemporal Videos (ERUV)
ERUV is able to learn richer representations and it outperforms state-of-the-art self-supervised methods with significant margins.
arXiv Detail & Related papers (2020-08-06T15:29:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.