Time Does Tell: Self-Supervised Time-Tuning of Dense Image
Representations
- URL: http://arxiv.org/abs/2308.11796v1
- Date: Tue, 22 Aug 2023 21:28:58 GMT
- Title: Time Does Tell: Self-Supervised Time-Tuning of Dense Image
Representations
- Authors: Mohammadreza Salehi, Efstratios Gavves, Cees G. M. Snoek, Yuki M.
Asano
- Abstract summary: We propose a novel approach that incorporates temporal consistency in dense self-supervised learning.
Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos.
Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images.
- Score: 79.87044240860466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatially dense self-supervised learning is a rapidly growing problem domain
with promising applications for unsupervised segmentation and pretraining for
dense downstream tasks. Despite the abundance of temporal data in the form of
videos, this information-rich source has been largely overlooked. Our paper
aims to address this gap by proposing a novel approach that incorporates
temporal consistency in dense self-supervised learning. While methods designed
solely for images face difficulties in achieving even the same performance on
videos, our method improves not only the representation quality for videos-but
also images. Our approach, which we call time-tuning, starts from
image-pretrained models and fine-tunes them with a novel self-supervised
temporal-alignment clustering loss on unlabeled videos. This effectively
facilitates the transfer of high-level information from videos to image
representations. Time-tuning improves the state-of-the-art by 8-10% for
unsupervised semantic segmentation on videos and matches it for images. We
believe this method paves the way for further self-supervised scaling by
leveraging the abundant availability of videos. The implementation can be found
here : https://github.com/SMSD75/Timetuning
Related papers
- DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z) - Correlation-aware active learning for surgery video segmentation [13.327429312047396]
This work proposes a novel AL strategy for surgery video segmentation, COWAL, COrrelation-aWare Active Learning.
Our approach involves projecting images into a latent space that has been fine-tuned using contrastive learning and then selecting a fixed number of representative images from local clusters of video frames.
We demonstrate the effectiveness of this approach on two video datasets of surgical instruments and three real-world video datasets.
arXiv Detail & Related papers (2023-11-15T09:30:52Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - iBoot: Image-bootstrapped Self-Supervised Video Representation Learning [45.845595749486215]
Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets.
We propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework.
The proposed algorithm is shown to learn much more efficiently in less epochs and with a smaller batch.
arXiv Detail & Related papers (2022-06-16T17:42:48Z) - Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly instead of a large dataset.
We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z) - Blind Video Temporal Consistency via Deep Video Prior [61.062900556483164]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly.
We show that temporal consistency can be achieved by training a convolutional network on a video with the Deep Video Prior.
arXiv Detail & Related papers (2020-10-22T16:19:20Z) - Watching the World Go By: Representation Learning from Unlabeled Videos [78.22211989028585]
Recent single image unsupervised representation learning techniques show remarkable success on a variety of tasks.
In this paper, we argue that videos offer this natural augmentation for free.
We propose Video Noise Contrastive Estimation, a method for using unlabeled video to learn strong, transferable single image representations.
arXiv Detail & Related papers (2020-03-18T00:07:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.