Joint-Relation Transformer for Multi-Person Motion Prediction
- URL: http://arxiv.org/abs/2308.04808v2
- Date: Fri, 27 Oct 2023 03:20:54 GMT
- Title: Joint-Relation Transformer for Multi-Person Motion Prediction
- Authors: Qingyao Xu, Weibo Mao, Jingze Gong, Chenxin Xu, Siheng Chen, Weidi
Xie, Ya Zhang, Yanfeng Wang
- Abstract summary: We propose the Joint-Relation Transformer to enhance interaction modeling.
Our method achieves a 13.4% improvement of 900ms VIM on 3DPW-SoMoF/RC and 17.8%/12.0% improvement of 3s MPJPE.
- Score: 79.08243886832601
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-person motion prediction is a challenging problem due to the dependency
of motion on both individual past movements and interactions with other people.
Transformer-based methods have shown promising results on this task, but they
miss the explicit relation representation between joints, such as skeleton
structure and pairwise distance, which is crucial for accurate interaction
modeling. In this paper, we propose the Joint-Relation Transformer, which
utilizes relation information to enhance interaction modeling and improve
future motion prediction. Our relation information contains the relative
distance and the intra-/inter-person physical constraints. To fuse relation and
joint information, we design a novel joint-relation fusion layer with
relation-aware attention to update both features. Additionally, we supervise
the relation information by forecasting future distance. Experiments show that
our method achieves a 13.4% improvement of 900ms VIM on 3DPW-SoMoF/RC and
17.8%/12.0% improvement of 3s MPJPE on CMU-Mpcap/MuPoTS-3D dataset.
Related papers
- Relation Learning and Aggregate-attention for Multi-person Motion Prediction [13.052342503276936]
Multi-person motion prediction considers not just the skeleton structures or human trajectories but also the interactions between others.
Previous methods often overlook that the joints relations within an individual (intra-relation) and interactions among groups (inter-relation) are distinct types of representations.
We introduce a new collaborative framework for multi-person motion prediction that explicitly modeling these relations.
arXiv Detail & Related papers (2024-11-06T07:48:30Z) - Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models [5.541130887628606]
Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME)
We introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes.
This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME.
arXiv Detail & Related papers (2024-10-01T01:14:24Z) - InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint [67.6297384588837]
We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs.
We demonstrate that the distance between joint pairs for human-wise interactions can be generated using an off-the-shelf Large Language Model.
arXiv Detail & Related papers (2023-11-27T14:32:33Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - Spatio-temporal MLP-graph network for 3D human pose estimation [8.267311047244881]
Graph convolutional networks and their variants have shown significant promise in 3D human pose estimation.
We introduce a new weighted Jacobi feature rule obtained through graph filtering with implicit propagation fairing.
We also employ adjacency modulation with the aim of learning meaningful correlations beyond defined between body joints.
arXiv Detail & Related papers (2023-08-29T14:00:55Z) - Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction [106.06256351200068]
This paper introduces a model learning framework with auxiliary tasks.
In our auxiliary tasks, partial body joints' coordinates are corrupted by either masking or adding noise.
We propose a novel auxiliary-adapted transformer, which can handle incomplete, corrupted motion data.
arXiv Detail & Related papers (2023-08-17T12:26:11Z) - PGformer: Proxy-Bridged Game Transformer for Multi-Person Highly
Interactive Extreme Motion Prediction [22.209454616479505]
This paper focuses on collaborative motion prediction for multiple persons with extreme motions.
A proxy unit is introduced to bridge the involved persons, which cooperates with our proposed XQA module.
Our approach can also be compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets.
arXiv Detail & Related papers (2023-06-06T03:25:09Z) - (Fusionformer):Exploiting the Joint Motion Synergy with Fusion Network
Based On Transformer for 3D Human Pose Estimation [1.52292571922932]
Many previous methods lack the understanding of local joint information.cite8888987considers the temporal relationship of a single joint in this work.
Our proposed textbfFusionformer method introduces a global-temporal self-trajectory module and a cross-temporal self-trajectory module.
The results show an improvement of 2.4% MPJPE and 4.3% P-MPJPE on the Human3.6M dataset.
arXiv Detail & Related papers (2022-10-08T12:22:10Z) - MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z) - End-to-end Contextual Perception and Prediction with Interaction
Transformer [79.14001602890417]
We tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving.
To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture.
Our model can be trained end-to-end, and runs in real-time.
arXiv Detail & Related papers (2020-08-13T14:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.