Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception
- URL: http://arxiv.org/abs/2602.23069v1
- Date: Thu, 26 Feb 2026 14:58:59 GMT
- Title: Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception
- Authors: Yiding Sun, Jihua Zhu, Haozhe Cheng, Chaoyi Lu, Zhichuan Yang, Lin Chen, Yaonan Wang,
- Abstract summary: Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction.<n>We develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages.<n>We show that PointATA can match or even outperform strong full fine-tuning models.
- Score: 44.7850628565891
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.
Related papers
- Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis [53.48281548500864]
Motion 3-to-4 is a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video.<n>Our model learns a compact motion latent representation and predicts per-frame trajectories to recover complete robustness, temporally coherent geometry.
arXiv Detail & Related papers (2026-01-20T18:59:48Z) - SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation [30.72482055095692]
SWiT-4D is a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation.<n> SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator.<n>It achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision.
arXiv Detail & Related papers (2025-12-11T17:54:31Z) - Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding [54.859943475818234]
We present Motion4D, a novel framework that integrates 2D priors from foundation models into a unified 4D Gaussian Splatting representation.<n>Our method features a two-part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long-term coherence.<n>Our method significantly outperforms both 2D foundation models and existing 3D-based approaches across diverse scene understanding tasks, including point-based tracking, video object segmentation, and novel view synthesis.
arXiv Detail & Related papers (2025-12-03T09:32:56Z) - Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models [79.06910348413861]
We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image.<n>Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion.
arXiv Detail & Related papers (2025-11-01T11:16:25Z) - TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP [52.79100775328595]
3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions.<n>Existing 3D visual grounding methods rely on separate encoders for different modalities.<n>We propose a unified 2D pre-trained multi-modal network to process all three modalities.
arXiv Detail & Related papers (2025-07-20T10:28:06Z) - Easi3R: Estimating Disentangled Motion from DUSt3R Without Training [69.51086319339662]
We introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction.<n>Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning.<n>Our experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2025-03-31T17:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.