PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks
- URL: http://arxiv.org/abs/2507.20170v1
- Date: Sun, 27 Jul 2025 08:20:49 GMT
- Title: PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks
- Authors: Clinton Ansun Mo, Kun Hu, Chengjiang Long, Dong Yuan, Wan-Chi Siu, Zhiyong Wang,
- Abstract summary: Motion skeletons drive 3D character animation by transforming bone hierarchies, but differences in proportions or structure make motion data hard to transfer across skeletons.<n>Temporal Point Clouds (TPCs) offer an unstructured, cross-compatible motion representation.<n>We propose PUMPS, the primordial autoencoder architecture for TPC data.
- Score: 44.19486142246208
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Motion skeletons drive 3D character animation by transforming bone hierarchies, but differences in proportions or structure make motion data hard to transfer across skeletons, posing challenges for data-driven motion synthesis. Temporal Point Clouds (TPCs) offer an unstructured, cross-compatible motion representation. Though reversible with skeletons, TPCs mainly serve for compatibility, not for direct motion task learning. Doing so would require data synthesis capabilities for the TPC format, which presents unexplored challenges regarding its unique temporal consistency and point identifiability. Therefore, we propose PUMPS, the primordial autoencoder architecture for TPC data. PUMPS independently reduces frame-wise point clouds into sampleable feature vectors, from which a decoder extracts distinct temporal points using latent Gaussian noise vectors as sampling identifiers. We introduce linear assignment-based point pairing to optimise the TPC reconstruction process, and negate the use of expensive point-wise attention mechanisms in the architecture. Using these latent features, we pre-train a motion synthesis model capable of performing motion prediction, transition generation, and keyframe interpolation. For these pre-training tasks, PUMPS performs remarkably well even without native dataset supervision, matching state-of-the-art performance. When fine-tuned for motion denoising or estimation, PUMPS outperforms many respective methods without deviating from its generalist architecture.
Related papers
- Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation [10.122882293302787]
temporal segmentation of human actions is critical for intelligent robots in collaborative settings.<n>We propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data.<n>Our approach outperforms state-of-the-art methods, especially in action segmentation accuracy.
arXiv Detail & Related papers (2025-07-01T13:55:57Z) - Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z) - Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video.<n>Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one.<n>We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z) - Motion Keyframe Interpolation for Any Human Skeleton via Temporally Consistent Point Cloud Sampling and Reconstruction [37.796387171573734]
Point Cloud-based Motion Representation (PC-MRL) is an unsupervised approach to enabling cross-compatibility between skeletons for motion learning.<n>PC-MRL consists of a skeleton obfuscation strategy using temporal point cloud sampling, and an unsupervised skeleton reconstruction method from point clouds.<n>We propose First-frame Offset Quaternion (FOQ) and Rest Pose Augmentation (RPA) strategies to overcome necessary limitations of our unsupervised point cloud-to-skeletal motion process.
arXiv Detail & Related papers (2024-05-13T03:03:04Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action
Recognition [38.27785891922479]
Few-shot learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt.
arXiv Detail & Related papers (2022-10-30T11:46:38Z) - 3D Skeleton-based Few-shot Action Recognition with JEANIE is not so
Na\"ive [28.720272938306692]
We propose a Few-shot Learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt.
arXiv Detail & Related papers (2021-12-23T16:09:23Z) - Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy.
MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z) - Skeleton-based Action Recognition via Spatial and Temporal Transformer
Networks [12.06555892772049]
We propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator.
The proposed ST-TR achieves state-of-the-art performance on all datasets when using joints' coordinates as input, and results on-par with state-of-the-art when adding bones information.
arXiv Detail & Related papers (2020-08-17T15:25:40Z) - Perceptron Synthesis Network: Rethinking the Action Scale Variances in
Videos [48.57686258913474]
Video action recognition has been partially addressed by the CNNs stacking of fixed-size 3D kernels.
We propose to learn the optimal-scale kernels from the data.
An textitaction perceptron synthesizer is proposed to generate the kernels from a bag of fixed-size kernels.
arXiv Detail & Related papers (2020-07-22T14:22:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.