Related papers: Self-supervised video pretraining yields human-aligned visual representations

Self-supervised video pretraining yields human-aligned visual representations

URL: http://arxiv.org/abs/2210.06433v2
Date: Tue, 25 Jul 2023 16:43:33 GMT
Title: Self-supervised video pretraining yields human-aligned visual representations
Authors: Nikhil Parthasarathy, S. M. Ali Eslami, Jo\~ao Carreira, Olivier J. H\'enaff
Abstract summary: General representations far outperform prior video pretraining methods on image understanding tasks. VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. These results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.
Score: 10.406358397515838
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding tasks. Moreover, VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. Finally, VITO's predictions are strongly aligned with human judgements, surpassing models that were specifically trained for that purpose. Together, these results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.

Related papers

Whole-Body Conditioned Egocentric Video Prediction [98.94980209293776]
We train models to Predict Ego-centric Video from human Actions (PEVA)<n>By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view.<n>Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.
arXiv Detail & Related papers (2025-06-26T17:59:59Z)
An Empirical Study of Autoregressive Pre-training from Videos [67.15356613065542]
We treat videos as visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance.
arXiv Detail & Related papers (2025-01-09T18:59:58Z)
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations [19.45821593625599]
Video diffusion models (VDMs) have demonstrated the capability to accurately predict future image sequences. We propose the Video Prediction Policy (VPP), a generalist robotic policy conditioned on the predictive visual representations from VDMs. VPP consistently outperforms existing methods across two simulated and two real-world benchmarks.
arXiv Detail & Related papers (2024-12-19T12:48:40Z)
Pre-trained Visual Dynamics Representations for Efficient Policy Learning [33.62440075940917]
We propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning. The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos. This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation.
arXiv Detail & Related papers (2024-11-05T15:18:02Z)
MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild [32.6521941706907]
We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. We first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering.
arXiv Detail & Related papers (2024-06-03T17:59:57Z)
Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption. SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning. We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z)
Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders. Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z)
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks. Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones. This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z)
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing. The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states. The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z)
PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object. PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects. We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z)
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions [81.88294320397826]
A system does not know what human-object interactions are present in a video as or the actual location of the human and object. We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions. We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
arXiv Detail & Related papers (2021-10-07T15:30:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.