Related papers: A Spatio-temporal Transformer for 3D Human Motion Prediction

A Spatio-temporal Transformer for 3D Human Motion Prediction

URL: http://arxiv.org/abs/2004.08692v3
Date: Mon, 29 Nov 2021 15:13:04 GMT
Title: A Spatio-temporal Transformer for 3D Human Motion Prediction
Authors: Emre Aksan, Manuel Kaufmann, Peng Cao, Otmar Hilliges
Abstract summary: We propose a Transformer-based architecture for the task of generative modelling of 3D human motion. We empirically show that this effectively learns the underlying motion dynamics and reduces error accumulation over time observed in auto-gressive models.
Score: 39.31212055504893
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We propose a novel Transformer-based architecture for the task of generative modelling of 3D human motion. Previous work commonly relies on RNN-based models considering shorter forecast horizons reaching a stationary and often implausible state quickly. Recent studies show that implicit temporal representations in the frequency domain are also effective in making predictions for a predetermined horizon. Our focus lies on learning spatio-temporal representations autoregressively and hence generation of plausible future developments over both short and long term. The proposed model learns high dimensional embeddings for skeletal joints and how to compose a temporally coherent pose via a decoupled temporal and spatial self-attention mechanism. Our dual attention concept allows the model to access current and past information directly and to capture both the structural and the temporal dependencies explicitly. We show empirically that this effectively learns the underlying motion dynamics and reduces error accumulation over time observed in auto-regressive models. Our model is able to make accurate short-term predictions and generate plausible motion sequences over long horizons. We make our code publicly available at https://github.com/eth-ait/motion-transformer.

Related papers

OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction [25.630452373274636]
OccTENS is a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation.<n>We reformulate the occupancy world model as a temporal next-scale prediction (TENS) task.<n>OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.
arXiv Detail & Related papers (2025-09-04T05:06:47Z)
A Mixture of Experts Approach to 3D Human Motion Prediction [1.4974445469089412]
This project addresses the challenge of human motion prediction, a critical area for applications such as au- tonomous vehicle movement detection. Our primary objective is to critically evaluate existing model ar-tectures, identifying their advantages and opportunities for improvement. The particular variation that is used is Soft MoE, a fully-differentiable sparse Transformer that has shown promising ability to enable larger model capacity at lower inference cost.
arXiv Detail & Related papers (2024-05-09T20:26:58Z)
Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past. We leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z)
AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving [59.94343412438211]
We introduce the GPT style next token motion prediction into motion prediction. Different from language data which is composed of homogeneous units -words, the elements in the driving scene could have complex spatial-temporal and semantic relations. We propose to adopt three factorized attention modules with different neighbors for information aggregation and different position encoding styles to capture their relations.
arXiv Detail & Related papers (2024-03-20T06:22:37Z)
Equivariant Graph Neural Operator for Modeling 3D Dynamics [148.98826858078556]
We propose Equivariant Graph Neural Operator (EGNO) to directly models dynamics as trajectories instead of just next-step prediction. EGNO explicitly learns the temporal evolution of 3D dynamics where we formulate the dynamics as a function over time and learn neural operators to approximate it. Comprehensive experiments in multiple domains, including particle simulations, human motion capture, and molecular dynamics, demonstrate the significantly superior performance of EGNO against existing methods.
arXiv Detail & Related papers (2024-01-19T21:50:32Z)
Triplet Attention Transformer for Spatiotemporal Predictive Learning [9.059462850026216]
We propose an innovative triplet attention transformer designed to capture both inter-frame dynamics and intra-frame static features. The model incorporates the Triplet Attention Module (TAM), which replaces traditional recurrent units by exploring self-attention mechanisms in temporal, spatial, and channel dimensions.
arXiv Detail & Related papers (2023-10-28T12:49:33Z)
SPOTR: Spatio-temporal Pose Transformers for Human Motion Prediction [12.248428883804763]
3D human motion prediction is a research area computation of high significance and a challenge in computer vision. Traditionally, autogregressive models have been used to predict human motion. We present a non-autoregressive model for human motion prediction.
arXiv Detail & Related papers (2023-03-11T01:44:29Z)
Investigating Pose Representations and Motion Contexts Modeling for 3D Motion Prediction [63.62263239934777]
We conduct an indepth study on various pose representations with a focus on their effects on the motion prediction task. We propose a novel RNN architecture termed AHMR (Attentive Hierarchical Motion Recurrent network) for motion prediction. Our approach outperforms the state-of-the-art methods in short-term prediction and achieves much enhanced long-term prediction proficiency.
arXiv Detail & Related papers (2021-12-30T10:45:22Z)
Multi-frame sequence generator of 4D human body motion [0.0]
We propose a generative auto-encoder-based framework, which encodes, global locomotion including translation and rotation, and multi-frame temporal motion as a single latent space vector. Our results validate the ability of the model to reconstruct 4D sequences of human morphology within a low error bound. We also illustrate the benefits of the approach for 4D human motion prediction of future frames from initial human frames.
arXiv Detail & Related papers (2021-06-07T13:56:46Z)
End-to-end Contextual Perception and Prediction with Interaction Transformer [79.14001602890417]
We tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving. To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture. Our model can be trained end-to-end, and runs in real-time.
arXiv Detail & Related papers (2020-08-13T14:30:12Z)
Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction [29.602903750712713]
We present STAR, a Spatio-Temporal grAph tRans framework, which tackles trajectory prediction by only attention mechanisms. We show that STAR achieves state-of-the-art performance on 5 commonly used real-world pedestrian prediction datasets.
arXiv Detail & Related papers (2020-05-18T08:08:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.