Scene Summarization: Clustering Scene Videos into Spatially Diverse
Frames
- URL: http://arxiv.org/abs/2311.17940v1
- Date: Tue, 28 Nov 2023 22:18:26 GMT
- Title: Scene Summarization: Clustering Scene Videos into Spatially Diverse
Frames
- Authors: Chao Chen, Mingzhi Zhu, Ankush Pratap Singh, Yu Yan, Felix Juefei Xu,
Chen Feng
- Abstract summary: We propose summarization as a new video-based scene understanding task.
It aims to summarize a long video walkthrough of a scene into a small set of frames that are spatially diverse in the scene.
Our solution is a two-stage self-supervised pipeline named SceneSum.
- Score: 24.614476456145255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose scene summarization as a new video-based scene understanding task.
It aims to summarize a long video walkthrough of a scene into a small set of
frames that are spatially diverse in the scene, which has many impotant
applications, such as in surveillance, real estate, and robotics. It stems from
video summarization but focuses on long and continuous videos from moving
cameras, instead of user-edited fragmented video clips that are more commonly
studied in existing video summarization works. Our solution to this task is a
two-stage self-supervised pipeline named SceneSum. Its first stage uses
clustering to segment the video sequence. Our key idea is to combine visual
place recognition (VPR) into this clustering process to promote spatial
diversity. Its second stage needs to select a representative keyframe from each
cluster as the summary while respecting resource constraints such as memory and
disk space limits. Additionally, if the ground truth image trajectory is
available, our method can be easily augmented with a supervised loss to enhance
the clustering and keyframe selection. Extensive experiments on both real-world
and simulated datasets show our method outperforms common video summarization
baselines by 50%
Related papers
- A Challenging Multimodal Video Summary: Simultaneously Extracting and
Generating Keyframe-Caption Pairs from Video [20.579167394855197]
This paper proposes a practical multimodal video summarization task setting and dataset to train and evaluate the task.
The target task involves summarizing a given video into a number ofcaption pairs and displaying them in a listable format to grasp the video content quickly.
This task is useful as a practical application and presents a highly challenging problem worthy of study.
arXiv Detail & Related papers (2023-12-04T02:17:14Z) - Self-supervised Object-Centric Learning for Videos [39.02148880719576]
We propose the first fully unsupervised method for segmenting multiple objects in real-world sequences.
Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames.
Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.
arXiv Detail & Related papers (2023-10-10T18:03:41Z) - Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA)
It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z) - Key Frame Extraction with Attention Based Deep Neural Networks [0.0]
We propose a deep learning-based approach for detection using a deep auto-encoder model with an attention layer.
The proposed method first extracts the features from the video frames using the encoder part of the autoencoder and applies segmentation using the k-means algorithm to group these features and similar frames together.
The method was evaluated on the TVSUM clustering video dataset and achieved a classification accuracy of 0.77, indicating a higher success rate than many existing methods.
arXiv Detail & Related papers (2023-06-21T15:09:37Z) - Scene Consistency Representation Learning for Video Scene Segmentation [26.790491577584366]
We propose an effective Self-Supervised Learning (SSL) framework to learn better shot representations from long-term videos.
We present an SSL scheme to achieve scene consistency, while exploring considerable data augmentation and shuffling methods to boost the model generalizability.
Our method achieves the state-of-the-art performance on the task of Video Scene.
arXiv Detail & Related papers (2022-05-11T13:31:15Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Self-supervised Video-centralised Transformer for Video Face Clustering [58.12996668434134]
This paper presents a novel method for face clustering in videos using a video-centralised transformer.
We release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering.
arXiv Detail & Related papers (2022-03-24T16:38:54Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z) - VideoClick: Video Object Segmentation with a Single Click [93.7733828038616]
We propose a bottom up approach where given a single click for each object in a video, we obtain the segmentation masks of these objects in the full video.
In particular, we construct a correlation volume that assigns each pixel in a target frame to either one of the objects in the reference frame or the background.
Results on this new CityscapesVideo dataset show that our approach outperforms all the baselines in this challenging setting.
arXiv Detail & Related papers (2021-01-16T23:07:48Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.