Related papers: Spatiotemporal-Untrammelled Mixture of Experts for Multi-Person Motion Prediction

Spatiotemporal-Untrammelled Mixture of Experts for Multi-Person Motion Prediction

URL: http://arxiv.org/abs/2512.21707v1
Date: Thu, 25 Dec 2025 15:01:19 GMT
Title: Spatiotemporal-Untrammelled Mixture of Experts for Multi-Person Motion Prediction
Authors: Zheng Yin, Chengjian Li, Xiangbo Shu, Meiqi Cao, Rui Yan, Jinhui Tang,
Abstract summary: Comprehensively flexibly capturing the complex-temporal dependencies of human motion is critical for multi-person motion.<n>Existing methods grapple with two primary limitations.<n>High computational costs stemming from time of conventional attention.<n>Our model incorporates four distinct types oftemporal experts, each specializing in capturing different spatial or temporal dependencies.
Score: 53.555201955973104
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Comprehensively and flexibly capturing the complex spatio-temporal dependencies of human motion is critical for multi-person motion prediction. Existing methods grapple with two primary limitations: i) Inflexible spatiotemporal representation due to reliance on positional encodings for capturing spatiotemporal information. ii) High computational costs stemming from the quadratic time complexity of conventional attention mechanisms. To overcome these limitations, we propose the Spatiotemporal-Untrammelled Mixture of Experts (ST-MoE), which flexibly explores complex spatio-temporal dependencies in human motion and significantly reduces computational cost. To adaptively mine complex spatio-temporal patterns from human motion, our model incorporates four distinct types of spatiotemporal experts, each specializing in capturing different spatial or temporal dependencies. To reduce the potential computational overhead while integrating multiple experts, we introduce bidirectional spatiotemporal Mamba as experts, each sharing bidirectional temporal and spatial Mamba in distinct combinations to achieve model efficiency and parameter economy. Extensive experiments on four multi-person benchmark datasets demonstrate that our approach not only outperforms state-of-art in accuracy but also reduces model parameter by 41.38% and achieves a 3.6x speedup in training. The code is available at https://github.com/alanyz106/ST-MoE.

Related papers

RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion [64.49056527678606]
We propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the radar-temporal encoder.<n>Unlike prior approaches, our method integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion.<n>Our experiments and evaluations demonstrate that the proposed method significantly outperforms state-of-the-art approaches, robustness local fidelity, generalization, and superior in complex precipitation forecasting scenarios.
arXiv Detail & Related papers (2025-10-16T17:59:13Z)
STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction [12.810918443757382]
Long-term-temporal time-series has developed rapidly, yet existing deep learning methods struggle with learning complex long-term-temporal dependencies efficiently.<n>In this paper, we propose an efficient textittextbfSTemporal textbfMultiscale textbfMamba (STM2) that includes a multiscale Mamba architecture and an adaptive graph causal convolution network to learn the complex multiscale-temporal dependency.
arXiv Detail & Related papers (2025-08-17T05:29:58Z)
Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions [45.51160285910023]
We propose a computationally efficient model for multi-person motion prediction by simplifying spatial and temporal interactions.<n>We achieve state-of-the-art performance for multiple metrics on standard datasets of CMU-Mocap, MuPoTS-3D, and 3DPW.
arXiv Detail & Related papers (2025-07-13T02:16:37Z)
Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting [16.782154479264126]
Predicting backbone-temporal traffic flow presents challenges due to complex interactions between temporal factors. Existing approaches address these dimensions in isolation, neglecting their critical interdependencies. In this paper, we introduce Sanonymous-Temporal Unitized Unitized Cell (ASTUC), a unified framework designed to capture both spatial and temporal dependencies.
arXiv Detail & Related papers (2024-11-14T07:34:31Z)
PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model [7.286873011001679]
We propose a purely SSM-based approach with linear correlations for complexityD human pose estimation in monocular video video.<n>Specifically, we propose a bidirectional global temporal-local-temporal block that comprehensively models human joint relations within individual frames as well as across frames.<n>This strategy provides a more logical geometric ordering strategy, resulting in a combined-local spatial scan.
arXiv Detail & Related papers (2024-08-07T04:38:03Z)
Rethinking Spatio-Temporal Transformer for Traffic Prediction:Multi-level Multi-view Augmented Learning Framework [4.773547922851949]
Traffic is a challenging-temporal forecasting problem that involves highly complex semantic correlations. This paper proposes a Multi-level Multi-view Augmented-temporal Transformer (LVST) for traffic prediction.
arXiv Detail & Related papers (2024-06-17T07:36:57Z)
A Decoupled Spatio-Temporal Framework for Skeleton-based Action Segmentation [89.86345494602642]
Existing methods are limited in weak-temporal modeling capability. We propose a Decoupled Scoupled Framework (DeST) to address the issues. DeST significantly outperforms current state-of-the-art methods with less computational complexity.
arXiv Detail & Related papers (2023-12-10T09:11:39Z)
Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications. Traditional methods rely on hand-crafted features and machine learning techniques. We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z)
Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition [62.46544616232238]
Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation. We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
arXiv Detail & Related papers (2021-12-16T18:59:47Z)
Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions. The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.