Related papers: Unique Lives, Shared World: Learning from Single-Life Videos

Unique Lives, Shared World: Learning from Single-Life Videos

URL: http://arxiv.org/abs/2512.04085v1
Date: Wed, 03 Dec 2025 18:59:57 GMT
Title: Unique Lives, Shared World: Learning from Single-Life Videos
Authors: Tengda Han, Sayna Ebrahimi, Dilara Gokay, Li Yang Ku, Maks Ovsjanikov, Iva Babukova, Daniel Zoran, Viorica Patraucean, Joao Carreira, Andrew Zisserman, Dima Damen,
Abstract summary: We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner.<n>We show that models trained independently on different lives develop a highly aligned geometric understanding.<n>Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data.
Score: 77.78726253186024
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.

Related papers

Simulated Cortical Magnification Supports Self-Supervised Object Learning [8.07351541700131]
Recent self-supervised learning models simulate the development of semantic object representations by training on visual experience similar to that of toddlers.<n>Here, we investigate the role of this varying resolution in the development of object representations.
arXiv Detail & Related papers (2025-09-19T08:28:06Z)
A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability [10.79834380458689]
Self-supervised learning confronts significant privacy concerns, especially in vision.<n>In this paper, we perform membership inference on visual self-supervised models in a more realistic setting.<n>We propose a unified membership inference method called PartCrop.
arXiv Detail & Related papers (2025-05-15T14:43:34Z)
Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape. We collect 35K trials of behavioral data from over 500 participants. We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z)
View-Invariant Policy Learning via Zero-Shot Novel View Synthesis [26.231630397802785]
We investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint.<n>We study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints.<n>For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments.
arXiv Detail & Related papers (2024-09-05T16:39:21Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition [6.995226697189459]
We employ a multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks. We release our pre-trained models as well as source code publicly.
arXiv Detail & Related papers (2024-04-16T20:51:36Z)
Improving Video Violence Recognition with Human Interaction Learning on 3D Skeleton Point Clouds [88.87985219999764]
We develop a method for video violence recognition from a new perspective of skeleton points. We first formulate 3D skeleton point clouds from human sequences extracted from videos. We then perform interaction learning on these 3D skeleton point clouds.
arXiv Detail & Related papers (2023-08-26T12:55:18Z)
A Computational Account Of Self-Supervised Visual Learning From Egocentric Object Play [3.486683381782259]
We study how learning signals that equate different viewpoints can support robust visual learning. We find that representations learned by equating different physical viewpoints of an object benefit downstream image classification accuracy.
arXiv Detail & Related papers (2023-05-30T22:42:03Z)
Visualizing and Understanding Contrastive Learning [22.553990823550784]
We design visual explanation methods that contribute towards understanding similarity learning tasks from pairs of images. We also adapt existing metrics, used to evaluate visual explanations of image classification systems, to suit pairs of explanations.
arXiv Detail & Related papers (2022-06-20T13:01:46Z)
Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds [96.9027094562957]
We introduce a-temporal representation learning framework, capable of learning from unlabeled tasks. Inspired by how infants learn from visual data in the wild, we explore rich cues derived from the 3D data. STRL takes two temporally-related frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly.
arXiv Detail & Related papers (2021-09-01T04:17:11Z)
What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.