Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos
- URL: http://arxiv.org/abs/2504.04837v1
- Date: Mon, 07 Apr 2025 08:47:36 GMT
- Title: Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos
- Authors: Zhi Zuo, Chenyi Zhuang, Zhiqiang Shen, Pan Gao, Jie Qin,
- Abstract summary: We introduce the first self-disentangled MAE for learning discriminative 4D representations in the pre-training stage.<n>We demonstrate that it can boost the fine-tuning performance on all 4D tasks, which we term Uni4D.
- Score: 48.8325946928959
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Point cloud video representation learning is primarily built upon the masking strategy in a self-supervised manner. However, the progress is slow due to several significant challenges: (1) existing methods learn the motion particularly with hand-crafted designs, leading to unsatisfactory motion patterns during pre-training which are non-transferable on fine-tuning scenarios. (2) previous Masked AutoEncoder (MAE) frameworks are limited in resolving the huge representation gap inherent in 4D data. In this study, we introduce the first self-disentangled MAE for learning discriminative 4D representations in the pre-training stage. To address the first challenge, we propose to model the motion representation in a latent space. The second issue is resolved by introducing the latent tokens along with the typical geometry tokens to disentangle high-level and low-level features during decoding. Extensive experiments on MSR-Action3D, NTU-RGBD, HOI4D, NvGesture, and SHREC'17 verify this self-disentangled learning framework. We demonstrate that it can boost the fine-tuning performance on all 4D tasks, which we term Uni4D. Our pre-trained model presents discriminative and meaningful 4D representations, particularly benefits processing long videos, as Uni4D gets $+3.8\%$ segmentation accuracy on HOI4D, significantly outperforming either self-supervised or fully-supervised methods after end-to-end fine-tuning.
Related papers
- Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.<n>Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.<n>The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z) - Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields [56.184278668305076]
We introduce Feature4X, a universal framework to extend functionality from 2D vision foundation model into the 4D realm.<n>The framework is first to distill and lift the features of video foundation models into an explicit 4D feature field using Splatting.<n>Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps.
arXiv Detail & Related papers (2025-03-26T17:56:16Z) - Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene [122.42861221739123]
This paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning.<n>We propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes.
arXiv Detail & Related papers (2025-03-19T09:16:08Z) - AR4D: Autoregressive 4D Generation from Monocular Videos [27.61057927559143]
Existing approaches primarily rely on Score Distillation Sampling to infer novel-view videos.
We propose AR4D, a novel paradigm for SDS-free 4D generation.
We show that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.
arXiv Detail & Related papers (2025-01-03T09:27:36Z) - Scaling 4D Representations [77.85462796134455]
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video.<n>In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks.
arXiv Detail & Related papers (2024-12-19T18:59:51Z) - EG4D: Explicit Generation of 4D Object without Score Distillation [105.63506584772331]
DG4D is a novel framework that generates high-quality and consistent 4D assets without score distillation.
Our framework outperforms the baselines in generation quality by a considerable margin.
arXiv Detail & Related papers (2024-05-28T12:47:22Z) - NSM4D: Neural Scene Model Based Online 4D Point Cloud Sequence
Understanding [20.79861588128133]
We introduce a generic online 4D perception paradigm called NSM4D.
NSM4D serves as a plug-and-play strategy that can be adapted to existing 4D backbones.
We demonstrate significant improvements on various online perception benchmarks in indoor and outdoor settings.
arXiv Detail & Related papers (2023-10-12T13:42:49Z) - Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud
Sequence Representation Learning [14.033085586047799]
This paper proposes a new 4D self-supervised pre-training method called Complete-to-Partial 4D Distillation.
Our key idea is to formulate 4D self-supervised representation learning as a teacher-student knowledge distillation framework.
Experiments show that this approach significantly outperforms previous pre-training approaches on a wide range of 4D point cloud sequence understanding tasks.
arXiv Detail & Related papers (2022-12-10T16:26:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.