Related papers: PointNet4D: A Lightweight 4D Point Cloud Video Backbone for Online and Offline Perception in Robotic Applications

PointNet4D: A Lightweight 4D Point Cloud Video Backbone for Online and Offline Perception in Robotic Applications

URL: http://arxiv.org/abs/2512.01383v1
Date: Mon, 01 Dec 2025 07:58:01 GMT
Title: PointNet4D: A Lightweight 4D Point Cloud Video Backbone for Online and Offline Perception in Robotic Applications
Authors: Yunze Liu, Zifan Wang, Peiran Wu, Jiayang Ao,
Abstract summary: We propose PointNet4D, a lightweight 4D backbone optimized for both online and offline settings.<n>At its core is a Hybrid Mamba-Transformer temporal fusion block, which integrates the efficient state-space modeling of Mamba and the bidirectional modeling power of Transformers.<n>To enhance temporal understanding, we introduce 4DMAP, a frame-wise masked auto-regressive pretraining strategy that captures motion cues across frames.
Score: 17.120778989036012
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding dynamic 4D environments-3D space evolving over time-is critical for robotic and interactive systems. These applications demand systems that can process streaming point cloud video in real-time, often under resource constraints, while also benefiting from past and present observations when available. However, current 4D backbone networks rely heavily on spatiotemporal convolutions and Transformers, which are often computationally intensive and poorly suited to real-time applications. We propose PointNet4D, a lightweight 4D backbone optimized for both online and offline settings. At its core is a Hybrid Mamba-Transformer temporal fusion block, which integrates the efficient state-space modeling of Mamba and the bidirectional modeling power of Transformers. This enables PointNet4D to handle variable-length online sequences efficiently across different deployment scenarios. To enhance temporal understanding, we introduce 4DMAP, a frame-wise masked auto-regressive pretraining strategy that captures motion cues across frames. Our extensive evaluations across 9 tasks on 7 datasets, demonstrating consistent improvements across diverse domains. We further demonstrate PointNet4D's utility by building two robotic application systems: 4D Diffusion Policy and 4D Imitation Learning, achieving substantial gains on the RoboTwin and HandoverSim benchmarks.

Related papers

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time [54.67332582569525]
This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task.<n>Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time.<n>We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks.
arXiv Detail & Related papers (2025-12-09T18:57:21Z)
Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation [21.075786141331974]
We present emphTrack4DGen, a framework for generating dynamic 4D objects from sparse inputs.<n>In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator.<n>In Stage Two, we reconstruct a dynamic 4D-GS using a hybrid motion encoding.
arXiv Detail & Related papers (2025-12-05T21:13:04Z)
4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads [17.413013509299933]
We introduce 4DSegStreamer, a novel framework that employs a Dual-Thread System to efficiently process streaming frames.<n>The framework is general and can be seamlessly integrated into existing 3D and 4D segmentation methods to enable real-time capability.<n>It also demonstrates superior robustness compared to existing streaming perception approaches, particularly under high FPS conditions.
arXiv Detail & Related papers (2025-10-20T15:37:49Z)
Streaming 4D Visual Geometry Transformer [63.99937807085461]
We propose a streaming 4D visual geometry transformer to process the input sequence in an online manner.<n>We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 4D reconstruction.<n>Experiments on various 4D geometry perception benchmarks demonstrate that our model increases the inference speed in online scenarios.
arXiv Detail & Related papers (2025-07-15T17:59:57Z)
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training [69.51086319339662]
We introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction.<n>Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning.<n>Our experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2025-03-31T17:59:58Z)
Driv3R: Learning Dense 4D Reconstruction for Autonomous Driving [116.10577967146762]
We propose Driv3R, a framework that directly regresses per-frame point maps from multi-view image sequences.<n>We employ a 4D flow predictor to identify moving objects within the scene to direct our network focus more on reconstructing these dynamic regions.<n>Driv3R outperforms previous frameworks in 4D dynamic scene reconstruction, achieving 15x faster inference speed.
arXiv Detail & Related papers (2024-12-09T18:58:03Z)
Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models [116.31344506738816]
We present a novel framework, textbfDiffusion4D, for efficient and scalable 4D content generation. We develop a 4D-aware video diffusion model capable of synthesizing orbital views of dynamic 3D assets. Our method surpasses prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency.
arXiv Detail & Related papers (2024-05-26T17:47:34Z)
MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models [14.024240637175216]
We propose a novel point cloud video understanding backbone based on the State Space Models (SSMs)<n> Specifically, we first disentangle space and time in 4D video sequences and then establish the spatial-temporal correlation with our designed Mamba blocks.<n>Our method has a significant efficiency improvement with 87.5% GPU memory reduction and 5.36 times speed-up.
arXiv Detail & Related papers (2024-05-23T09:08:09Z)
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency [118.15258850780417]
We present textbf4DGen, a novel framework for grounded 4D content creation.<n>Our pipeline facilitates controllable 4D generation, enabling users to specify the motion via monocular video or adopt image-to-video generations.<n>Compared to existing video-to-4D baselines, our approach yields superior results in faithfully reconstructing input signals.
arXiv Detail & Related papers (2023-12-28T18:53:39Z)
NSM4D: Neural Scene Model Based Online 4D Point Cloud Sequence Understanding [20.79861588128133]
We introduce a generic online 4D perception paradigm called NSM4D. NSM4D serves as a plug-and-play strategy that can be adapted to existing 4D backbones. We demonstrate significant improvements on various online perception benchmarks in indoor and outdoor settings.
arXiv Detail & Related papers (2023-10-12T13:42:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.