Scaling 4D Representations
- URL: http://arxiv.org/abs/2412.15212v1
- Date: Thu, 19 Dec 2024 18:59:51 GMT
- Title: Scaling 4D Representations
- Authors: João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro Vélez, Luisa Polanía, Luke Friedman, Chris Duvarney, Ross Goroshin, Kelsey Allen, Jacob Walker, Rishabh Kabra, Eric Aboussouan, Jennifer Sun, Thomas Kipf, Carl Doersch, Viorica Pătrăucean, Dima Damen, Pauline Luc, Mehdi S. M. Sajjadi, Andrew Zisserman,
- Abstract summary: Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video.
In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks.
- Score: 77.85462796134455
- License:
- Abstract: Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.
Related papers
- Can Generative Video Models Help Pose Estimation? [42.10672365565019]
Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision.
Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose.
We propose a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition.
arXiv Detail & Related papers (2024-12-20T18:58:24Z) - Controlling Space and Time with Diffusion Models [34.7002868116714]
We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS)
We advocate joint training on 3D (with camera pose), 4D (pose+time) and video (time but no pose) data.
4DiM is also used for improved panorama stitching, pose-conditioned video to video translation, and several other tasks.
arXiv Detail & Related papers (2024-07-10T17:23:33Z) - Segment Any 4D Gaussians [69.53172192552508]
We propose Segment Any 4D Gaussians (SA4D) to segment anything in the 4D digital world based on 4D Gaussians.
SA4D achieves precise, high-quality segmentation within seconds in 4D Gaussians and shows the ability to remove, recolor, compose, and render high-quality anything masks.
arXiv Detail & Related papers (2024-07-05T13:44:15Z) - Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics.
We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z) - Probing the 3D Awareness of Visual Foundation Models [56.68380136809413]
We analyze the 3D awareness of visual foundation models.
We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
arXiv Detail & Related papers (2024-04-12T17:58:04Z) - MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving
Representation Learning [38.6654451726187]
MIM4D is a novel pre-training paradigm based on dual masked image modeling (MIM)
It constructs pseudo-3D features using continuous scene flow and projects them onto 2D plane for supervision.
It achieves state-of-the-art performance on the nuScenes dataset for visual representation learning in autonomous driving.
arXiv Detail & Related papers (2024-03-13T17:58:00Z) - DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z) - Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video [42.10482273572879]
We propose an efficient video-to-4D object generation framework called Efficient4D.
It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data.
Experiments on both synthetic and real videos show that Efficient4D offers a remarkable 10-fold increase in speed.
arXiv Detail & Related papers (2024-01-16T18:58:36Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.