Related papers: UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling

UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling

URL: http://arxiv.org/abs/2508.14604v1
Date: Wed, 20 Aug 2025 10:46:01 GMT
Title: UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling
Authors: Peiming Li, Ziyi Wang, Yulin Yuan, Hong Liu, Xiangming Meng, Junsong Yuan, Mengyuan Liu,
Abstract summary: Point cloud videos capture 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions.<n> Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity.<n>We propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos.
Score: 53.199942923818206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://github.com/wangzy01/UST-SSM.

Related papers

Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences [25.74000325019015]
We introduce a novel LiDAR 3D object detection framework, namely LiSTM, to facilitate spatial-temporal feature learning with cross-frame motion forecasting information. We have conducted experiments on the aggregation and nuScenes datasets to demonstrate that the proposed framework achieves superior 3D detection performance.
arXiv Detail & Related papers (2024-09-06T16:29:04Z)
MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models [14.024240637175216]
We propose a novel point cloud video understanding backbone based on the State Space Models (SSMs)<n> Specifically, we first disentangle space and time in 4D video sequences and then establish the spatial-temporal correlation with our designed Mamba blocks.<n>Our method has a significant efficiency improvement with 87.5% GPU memory reduction and 5.36 times speed-up.
arXiv Detail & Related papers (2024-05-23T09:08:09Z)
Dynamic 3D Point Cloud Sequences as 2D Videos [81.46246338686478]
3D point cloud sequences serve as one of the most common and practical representation modalities of real-world environments. We propose a novel generic representation called textitStructured Point Cloud Videos (SPCVs) SPCVs re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points.
arXiv Detail & Related papers (2024-03-02T08:18:57Z)
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos [75.9251839023226]
We propose a Masked-temporal Structure Prediction (MaST-Pre) method to capture the structure of point cloud videos without human annotations. MaST-Pre consists of two self-supervised learning tasks. First, by reconstructing masked point tubes, our method is able to capture appearance information of point cloud videos. Second, to learn motion, we propose a temporal cardinality difference prediction task that estimates the change in the number of points within a point tube.
arXiv Detail & Related papers (2023-08-18T02:12:54Z)
PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences [51.53563462897779]
We propose a point-ordered (PST) convolution to achieve informative representations of point cloud sequences. PST first disentangles space and time in point cloud sequences, then a spatial convolution is employed to capture local structure points in the 3D space, and a temporal convolution is used to model the dynamics of the spatial regions along the time dimension. We incorporate the proposed PST convolution into a deep network, namely PSTNet, to extract features of point cloud sequences in a hierarchical manner.
arXiv Detail & Related papers (2022-05-27T02:14:43Z)
Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes. We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature. We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z)
Real-time 3D human action recognition based on Hyperpoint sequence [14.218567196931687]
We propose a lightweight and effective point cloud sequence network for real-time 3D action recognition. Instead of capturing temporal-temporal local structures, SequentialPointNet encodes the temporal evolution of static appearances to recognize human actions. Experiments on three widely-used 3D action recognition datasets demonstrate that the proposed SequentialPointNet achieves competitive classification performance with up to 10X faster than existing approaches.
arXiv Detail & Related papers (2021-11-16T14:13:32Z)
Spatial-Temporal Transformer for 3D Point Cloud Sequences [23.000688043417913]
We propose a novel framework named Point Spatial-Temporal Transformer (PST2) to learn spatial-temporal representations. Our PST2 consists of two major modules: a Spatio-Temporal Self-Attention (STSA) module and a Resolution Embedding (RE) module. We test the effectiveness our PST2 with two different tasks on point cloud sequences, i.e., 4D semantic segmentation and 3D action recognition.
arXiv Detail & Related papers (2021-10-19T07:55:47Z)
Pseudo-LiDAR Point Cloud Interpolation Based on 3D Motion Representation and Spatial Supervision [68.35777836993212]
We propose a Pseudo-LiDAR point cloud network to generate temporally and spatially high-quality point cloud sequences. By exploiting the scene flow between point clouds, the proposed network is able to learn a more accurate representation of the 3D spatial motion relationship.
arXiv Detail & Related papers (2020-06-20T03:11:04Z)
Unsupervised Learning of Global Registration of Temporal Sequence of Point Clouds [16.019588704177288]
Global registration of point clouds aims to find an optimal alignment of a sequence of 2D or 3D point sets. We present a novel method that takes advantage of current deep learning techniques for unsupervised learning of global registration from a temporal sequence of point clouds.
arXiv Detail & Related papers (2020-06-17T06:00:36Z)
Spatio-Temporal Ranked-Attention Networks for Video Captioning [34.05025890230047]
We propose a model that combines spatial and temporal attention to videos in two different orders. We provide experiments on two benchmark datasets: MSVD and MSR-VTT. Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.
arXiv Detail & Related papers (2020-01-17T01:00:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.