Exploring Temporal Granularity in Self-Supervised Video Representation
Learning
- URL: http://arxiv.org/abs/2112.04480v1
- Date: Wed, 8 Dec 2021 18:58:42 GMT
- Title: Exploring Temporal Granularity in Self-Supervised Video Representation
Learning
- Authors: Rui Qian, Yeqing Li, Liangzhe Yuan, Boqing Gong, Ting Liu, Matthew
Brown, Serge Belongie, Ming-Hsuan Yang, Hartwig Adam, Yin Cui
- Abstract summary: This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations.
The flexibility of TeG gives rise to state-of-the-art results on 8 video benchmarks, outperforming supervised pre-training in most cases.
- Score: 99.02421058335533
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work presents a self-supervised learning framework named TeG to explore
Temporal Granularity in learning video representations. In TeG, we sample a
long clip from a video and a short clip that lies inside the long clip. We then
extract their dense temporal embeddings. The training objective consists of two
parts: a fine-grained temporal learning objective to maximize the similarity
between corresponding temporal embeddings in the short clip and the long clip,
and a persistent temporal learning objective to pull together global embeddings
of the two clips. Our study reveals the impact of temporal granularity with
three major findings. 1) Different video tasks may require features of
different temporal granularities. 2) Intriguingly, some tasks that are widely
considered to require temporal awareness can actually be well addressed by
temporally persistent features. 3) The flexibility of TeG gives rise to
state-of-the-art results on 8 video benchmarks, outperforming supervised
pre-training in most cases.
Related papers
- Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships.
We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains.
LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z) - TimesURL: Self-supervised Contrastive Learning for Universal Time Series
Representation Learning [31.458689807334228]
We propose a novel self-supervised framework named TimesURL to tackle time series representation.
Specifically, we first introduce a frequency-temporal-based augmentation to keep the temporal property unchanged.
We also construct double Universums as a special kind of hard negative to guide better contrastive learning.
arXiv Detail & Related papers (2023-12-25T12:23:26Z) - No More Shortcuts: Realizing the Potential of Temporal Self-Supervision [69.59938105887538]
We propose a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks.
We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision.
arXiv Detail & Related papers (2023-12-20T13:20:31Z) - TimeBalance: Temporally-Invariant and Temporally-Distinctive Video
Representations for Semi-Supervised Action Recognition [68.53072549422775]
We propose a student-teacher semi-supervised learning framework, TimeBalance.
We distill the knowledge from a temporally-invariant and a temporally-distinctive teacher.
Our method achieves state-of-the-art performance on three action recognition benchmarks.
arXiv Detail & Related papers (2023-03-28T19:28:54Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z) - Controllable Augmentations for Video Representation Learning [34.79719112810065]
We propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as minimization general long-term temporal relations.
Our framework is superior on three video benchmarks in action recognition and video retrieval, capturing more accurate temporal dynamics.
arXiv Detail & Related papers (2022-03-30T19:34:32Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.