Related papers: Learning from Streaming Video with Orthogonal Gradients

Learning from Streaming Video with Orthogonal Gradients

URL: http://arxiv.org/abs/2504.01961v1
Date: Wed, 02 Apr 2025 17:59:57 GMT
Title: Learning from Streaming Video with Orthogonal Gradients
Authors: Tengda Han, Dilara Gokay, Joseph Heyward, Chuhan Zhang, Daniel Zoran, Viorica Pătrăucean, João Carreira, Dima Damen, Andrew Zisserman,
Abstract summary: We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner.<n>This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch.<n>We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks.
Score: 62.51504086522027
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms. When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW in all three scenarios.

Related papers

Improving Video Generation with Human Feedback [81.48120703718774]
Video generation has achieved significant advances, but issues like unsmooth motion and misalignment between videos and prompts persist. We develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. We introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy.
arXiv Detail & Related papers (2025-01-23T18:55:41Z)
VideoSAVi: Self-Aligned Video Language Models without Human Supervision [0.6854849895338531]
VideoSAVi is a self-training pipeline that enables Video-LLMs to reason over video content without external supervision.<n>VideoSAVi achieves state-of-the-art performance on MVBench (74.0%) and delivers significant improvements.<n>Our model-agnostic approach is computationally efficient, requiring only 32 frames.
arXiv Detail & Related papers (2024-12-01T00:33:05Z)
Learning from One Continuous Video Stream [70.30084026960819]
We introduce a framework for online learning from a single continuous video stream. This poses great challenges given the high correlation between consecutive video frames. We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation.
arXiv Detail & Related papers (2023-12-01T14:03:30Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency. Our method is only trained on a pair of original and processed videos directly instead of a large dataset. We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z)
Dense Unsupervised Learning for Video Segmentation [49.46930315961636]
We present a novel approach to unsupervised learning for video object segmentation (VOS) Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. Our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power.
arXiv Detail & Related papers (2021-11-11T15:15:11Z)
Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets. Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z)
Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives. We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations. We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z)
Self-Supervised Learning via multi-Transformation Classification for Action Recognition [10.676377556393527]
We introduce a self-supervised video representation learning method based on the multi-transformation classification to efficiently classify human actions. The representation of the video is learned in a self-supervised manner by classifying seven different transformations. We have conducted the experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as backbone networks.
arXiv Detail & Related papers (2021-02-20T16:11:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.