Tracking by Predicting 3-D Gaussians Over Time
- URL: http://arxiv.org/abs/2512.22489v2
- Date: Tue, 30 Dec 2025 05:53:13 GMT
- Title: Tracking by Predicting 3-D Gaussians Over Time
- Authors: Tanish Baranwal, Himanshu Gaurav Singh, Jathushan Rajasegaran, Jitendra Malik,
- Abstract summary: Video-GMAE encodes a sequence of images into a set of Gaussian splats moving over time.<n>We find that tracking emerges when pretraining a network with this architecture.<n>With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets.
- Score: 36.74743544147803
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at https://videogmae.org/ and https://github.com/tekotan/video-gmae.
Related papers
- VTGaussian-SLAM: RGBD SLAM for Large Scale Scenes with Splatting View-Tied 3D Gaussians [27.62796825514193]
State-of-the-art methods employ 3D Gaussians to represent a scene, and render these Gaussians through splatting for higher efficiency and better rendering.<n>These methods cannot scale up to extremely large scenes, due to the inefficient tracking and mapping strategies.<n>To resolve this issue, we propose novel tracking and mapping strategies to work with a novel 3D representation, dubbed view-tied 3D Gaussians.
arXiv Detail & Related papers (2025-06-03T10:59:19Z) - SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians [63.38733349131787]
3D reconstruction of human heads from monocular images and videos underlies numerous visual applications.<n>Previous methods have sought to learn from abundant 2D videos in a self-supervised manner.<n>We propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians)
arXiv Detail & Related papers (2025-04-16T17:55:02Z) - GaussRender: Learning 3D Occupancy with Gaussian Rendering [86.89653628311565]
GaussRender is a module that improves 3D occupancy learning by enforcing projective consistency.<n>Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent 3D structure.
arXiv Detail & Related papers (2025-02-07T16:07:51Z) - GSVC: Efficient Video Representation and Compression Through 2D Gaussian Splatting [3.479384894190067]
We propose GSVC, an approach to learning a set of 2D Gaussian splats that can effectively represent and compress video frames.<n>Experiment results show that GSVC achieves good rate-distortion trade-offs, comparable to state-of-the-art video codecs.
arXiv Detail & Related papers (2025-01-21T11:30:51Z) - GaussianAD: Gaussian-Centric End-to-End Autonomous Driving [23.71316979650116]
Vision-based autonomous driving shows great potential due to its satisfactory performance and low costs.<n>Most existing methods adopt dense representations (e.g., bird's eye view) or sparse representations (e.g., instance boxes) for decision-making.<n>This paper explores a Gaussian-centric end-to-end autonomous driving framework and exploits 3D semantic Gaussians to extensively yet sparsely describe the scene.
arXiv Detail & Related papers (2024-12-13T18:59:30Z) - Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos [58.22272760132996]
We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained.
We propose Dynamic Gaussian Marbles, which consist of three core modifications that target the difficulties of the monocular setting.
We evaluate on the Nvidia Dynamic Scenes dataset and the DyCheck iPhone dataset, and show that Gaussian Marbles significantly outperforms other Gaussian baselines in quality.
arXiv Detail & Related papers (2024-06-26T19:37:07Z) - Splatter a Video: Video Gaussian Representation for Versatile Processing [48.9887736125712]
Video representation is crucial for various down-stream tasks, such as tracking,depth prediction,segmentation,view synthesis,and editing.
We introduce a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians.
It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation.
arXiv Detail & Related papers (2024-06-19T22:20:03Z) - Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting [94.84688557937123]
Video-3DGS is a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors.<n>Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos.<n>It enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
arXiv Detail & Related papers (2024-06-04T17:57:37Z) - LoopGaussian: Creating 3D Cinemagraph with Multi-view Images via Eulerian Motion Field [13.815932949774858]
Cinemagraph is a form of visual media that combines elements of still photography and subtle motion to create a captivating experience.
We propose LoopGaussian to elevate cinemagraph from 2D image space to 3D space using 3D Gaussian modeling.
Experiment results validate the effectiveness of our approach, demonstrating high-quality and visually appealing scene generation.
arXiv Detail & Related papers (2024-04-13T11:07:53Z) - Gaussian Grouping: Segment and Edit Anything in 3D Scenes [65.49196142146292]
We propose Gaussian Grouping, which extends Gaussian Splatting to jointly reconstruct and segment anything in open-world 3D scenes.
Compared to the implicit NeRF representation, we show that the grouped 3D Gaussians can reconstruct, segment and edit anything in 3D with high visual quality, fine granularity and efficiency.
arXiv Detail & Related papers (2023-12-01T17:09:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.