Related papers: Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos

Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos

URL: http://arxiv.org/abs/2504.04837v2
Date: Tue, 20 May 2025 07:47:12 GMT
Title: Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos
Authors: Zhi Zuo, Chenyi Zhuang, Pan Gao, Jie Qin, Hao Feng, Nicu Sebe,
Abstract summary: Existing methods rely on explicit knowledge to learn motion, resulting in suboptimal representations.<n>Prior Masked Autoentangler (MAE) frameworks struggle to bridge the gap between low-level geometry and high-level dynamics in 4D data.<n>We propose a novel self-disentangled MAE for learning expressive,riminative, and transferable 4D representations.
Score: 70.07088203106443
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised representation learning for point cloud videos remains a challenging problem with two key limitations: (1) existing methods rely on explicit knowledge to learn motion, resulting in suboptimal representations; (2) prior Masked AutoEncoder (MAE) frameworks struggle to bridge the gap between low-level geometry and high-level dynamics in 4D data. In this work, we propose a novel self-disentangled MAE for learning expressive, discriminative, and transferable 4D representations. To overcome the first limitation, we learn motion by aligning high-level semantics in the latent space \textit{without any explicit knowledge}. To tackle the second, we introduce a \textit{self-disentangled learning} strategy that incorporates the latent token with the geometry token within a shared decoder, effectively disentangling low-level geometry and high-level semantics. In addition to the reconstruction objective, we employ three alignment objectives to enhance temporal understanding, including frame-level motion and video-level global information. We show that our pre-trained encoder surprisingly discriminates spatio-temporal representation without further fine-tuning. Extensive experiments on MSR-Action3D, NTU-RGBD, HOI4D, NvGesture, and SHREC'17 demonstrate the superiority of our approach in both coarse-grained and fine-grained 4D downstream tasks. Notably, Uni4D improves action segmentation accuracy on HOI4D by $+3.8\%$.

Related papers

OpenHuman4D: Open-Vocabulary 4D Human Parsing [7.533936292165496]
We introduce the first 4D human parsing framework that reduces inference time and introduces open-vocabulary capabilities.<n>Building upon state-of-the-art open-vocabulary 3D human parsing techniques, our approach extends the support to 4D human-centric video.
arXiv Detail & Related papers (2025-07-14T03:35:06Z)
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.<n>Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.<n>The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z)
Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields [56.184278668305076]
We introduce Feature4X, a universal framework to extend functionality from 2D vision foundation model into the 4D realm.<n>The framework is first to distill and lift the features of video foundation models into an explicit 4D feature field using Splatting.<n>Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps.
arXiv Detail & Related papers (2025-03-26T17:56:16Z)
Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene [122.42861221739123]
This paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning.<n>We propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes.
arXiv Detail & Related papers (2025-03-19T09:16:08Z)
AR4D: Autoregressive 4D Generation from Monocular Videos [27.61057927559143]
Existing approaches primarily rely on Score Distillation Sampling to infer novel-view videos. We propose AR4D, a novel paradigm for SDS-free 4D generation. We show that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.
arXiv Detail & Related papers (2025-01-03T09:27:36Z)
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models [39.488763757826426]
2D Vision-Language Models (VLMs) have made significant strides in image-text understanding tasks.<n>Recent advances have leveraged 3D point clouds and multi-view images as inputs, yielding promising results.<n>We propose a vision-based solution inspired by human perception, which merely relies on visual cues for 3D spatial understanding.
arXiv Detail & Related papers (2025-01-02T18:59:59Z)
Scaling 4D Representations [77.85462796134455]
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video.<n>In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks.
arXiv Detail & Related papers (2024-12-19T18:59:51Z)
EG4D: Explicit Generation of 4D Object without Score Distillation [105.63506584772331]
DG4D is a novel framework that generates high-quality and consistent 4D assets without score distillation. Our framework outperforms the baselines in generation quality by a considerable margin.
arXiv Detail & Related papers (2024-05-28T12:47:22Z)
Dynamic 3D Point Cloud Sequences as 2D Videos [81.46246338686478]
3D point cloud sequences serve as one of the most common and practical representation modalities of real-world environments. We propose a novel generic representation called textitStructured Point Cloud Videos (SPCVs) SPCVs re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points.
arXiv Detail & Related papers (2024-03-02T08:18:57Z)
Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration [107.61458720202984]
This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes.<n>We propose the learnable transformation alignment to bridge the domain gap between image and point cloud data.<n>We establish dense 2D-3D correspondences to estimate the rigid pose.
arXiv Detail & Related papers (2024-01-23T02:41:06Z)
X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-modal Knowledge Transfer [28.719098240737605]
We propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. It enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks.
arXiv Detail & Related papers (2023-12-12T15:48:12Z)
Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z)
A Unified Approach for Text- and Image-guided 4D Scene Generation [58.658768832653834]
We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis. We show that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation. Our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.
arXiv Detail & Related papers (2023-11-28T15:03:53Z)
NSM4D: Neural Scene Model Based Online 4D Point Cloud Sequence Understanding [20.79861588128133]
We introduce a generic online 4D perception paradigm called NSM4D. NSM4D serves as a plug-and-play strategy that can be adapted to existing 4D backbones. We demonstrate significant improvements on various online perception benchmarks in indoor and outdoor settings.
arXiv Detail & Related papers (2023-10-12T13:42:49Z)
Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning [14.033085586047799]
This paper proposes a new 4D self-supervised pre-training method called Complete-to-Partial 4D Distillation. Our key idea is to formulate 4D self-supervised representation learning as a teacher-student knowledge distillation framework. Experiments show that this approach significantly outperforms previous pre-training approaches on a wide range of 4D point cloud sequence understanding tasks.
arXiv Detail & Related papers (2022-12-10T16:26:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.