Self-supervised video pretraining yields human-aligned visual
representations
- URL: http://arxiv.org/abs/2210.06433v2
- Date: Tue, 25 Jul 2023 16:43:33 GMT
- Title: Self-supervised video pretraining yields human-aligned visual
representations
- Authors: Nikhil Parthasarathy, S. M. Ali Eslami, Jo\~ao Carreira, Olivier J.
H\'enaff
- Abstract summary: General representations far outperform prior video pretraining methods on image understanding tasks.
VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones.
These results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.
- Score: 10.406358397515838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans learn powerful representations of objects and scenes by observing how
they evolve over time. Yet, outside of specific tasks that require explicit
temporal understanding, static image pretraining remains the dominant paradigm
for learning visual foundation models. We question this mismatch, and ask
whether video pretraining can yield visual representations that bear the
hallmarks of human perception: generalisation across tasks, robustness to
perturbations, and consistency with human judgements. To that end we propose a
novel procedure for curating videos, and develop a contrastive framework which
learns from the complex transformations therein. This simple paradigm for
distilling knowledge from videos, called VITO, yields general representations
that far outperform prior video pretraining methods on image understanding
tasks, and image pretraining methods on video understanding tasks. Moreover,
VITO representations are significantly more robust to natural and synthetic
deformations than image-, video-, and adversarially-trained ones. Finally,
VITO's predictions are strongly aligned with human judgements, surpassing
models that were specifically trained for that purpose. Together, these results
suggest that video pretraining could be a simple way of learning unified,
robust, and human-aligned representations of the visual world.
Related papers
- Pre-trained Visual Dynamics Representations for Efficient Policy Learning [33.62440075940917]
We propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning.
The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos.
This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation.
arXiv Detail & Related papers (2024-11-05T15:18:02Z) - MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild [32.6521941706907]
We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos.
We first define a layered neural representation for the entire scene, composited by individual human and background models.
We learn the layered neural representation from videos via our layer-wise differentiable volume rendering.
arXiv Detail & Related papers (2024-06-03T17:59:57Z) - Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption.
SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning.
We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Weakly Supervised Human-Object Interaction Detection in Video via
Contrastive Spatiotemporal Regions [81.88294320397826]
A system does not know what human-object interactions are present in a video as or the actual location of the human and object.
We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions.
We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
arXiv Detail & Related papers (2021-10-07T15:30:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.