Temporally stable video segmentation without video annotations
- URL: http://arxiv.org/abs/2110.08893v1
- Date: Sun, 17 Oct 2021 18:59:11 GMT
- Title: Temporally stable video segmentation without video annotations
- Authors: Aharon Azulay, Tavi Halperin, Orestis Vantzos, Nadav Bornstein, Ofir
Bibi
- Abstract summary: We introduce a method to adapt still image segmentation models to video in an unsupervised manner.
We verify that the consistency measure is well correlated with human judgement via a user study.
We observe improvements in the generated segmented videos with minimal loss of accuracy.
- Score: 6.184270985214255
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Temporally consistent dense video annotations are scarce and hard to collect.
In contrast, image segmentation datasets (and pre-trained models) are
ubiquitous, and easier to label for any novel task. In this paper, we introduce
a method to adapt still image segmentation models to video in an unsupervised
manner, by using an optical flow-based consistency measure. To ensure that the
inferred segmented videos appear more stable in practice, we verify that the
consistency measure is well correlated with human judgement via a user study.
Training a new multi-input multi-output decoder using this measure as a loss,
together with a technique for refining current image segmentation datasets and
a temporal weighted-guided filter, we observe stability improvements in the
generated segmented videos with minimal loss of accuracy.
Related papers
- VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z) - Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length.
A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length.
This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z) - Time Does Tell: Self-Supervised Time-Tuning of Dense Image
Representations [79.87044240860466]
We propose a novel approach that incorporates temporal consistency in dense self-supervised learning.
Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos.
Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images.
arXiv Detail & Related papers (2023-08-22T21:28:58Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z) - Video Demoireing with Relation-Based Temporal Consistency [68.20281109859998]
Moire patterns, appearing as color distortions, severely degrade image and video qualities when filming a screen with digital cameras.
We study how to remove such undesirable moire patterns in videos, namely video demoireing.
arXiv Detail & Related papers (2022-04-06T17:45:38Z) - Adaptive Compact Attention For Few-shot Video-to-video Translation [13.535988102579918]
We introduce a novel adaptive compact attention mechanism to efficiently extract contextual features jointly from multiple reference images.
Our core idea is to extract compact basis sets from all the reference images as higher-level representations.
We extensively evaluate our method on a large-scale talking-head video dataset and a human dancing dataset.
arXiv Detail & Related papers (2020-11-30T11:19:12Z) - Coherent Loss: A Generic Framework for Stable Video Segmentation [103.78087255807482]
We investigate how a jittering artifact degrades the visual quality of video segmentation results.
We propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts.
arXiv Detail & Related papers (2020-10-25T10:48:28Z) - Improving Semantic Segmentation through Spatio-Temporal Consistency
Learned from Videos [39.25927216187176]
We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve single-image semantic segmentation.
The predicted depth, egomotion, and camera intrinsics are used to provide an additional supervision signal to the segmentation model.
arXiv Detail & Related papers (2020-04-11T07:09:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.