MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human Motions
- URL: http://arxiv.org/abs/2404.13657v1
- Date: Sun, 21 Apr 2024 13:25:46 GMT
- Title: MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human Motions
- Authors: Sheng Yan, Mengyuan Liu, Yong Wang, Yang Liu, Chen Chen, Hong Liu,
- Abstract summary: We aim to locate a target moment from a 3D human motion that semantically corresponds to a text query.
To refine this, we devise two novel label-prior-knowledge training schemes.
We show that injecting label-prior knowledge into the model is crucial for improving performance sequences at high IoU.
- Score: 20.986063755422173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we address the unexplored question of temporal sentence localization in human motions (TSLM), aiming to locate a target moment from a 3D human motion that semantically corresponds to a text query. Considering that 3D human motions are captured using specialized motion capture devices, motions with only a few joints lack complex scene information like objects and lighting. Due to this character, motion data has low contextual richness and semantic ambiguity between frames, which limits the accuracy of predictions made by current video localization frameworks extended to TSLM to only a rough level. To refine this, we devise two novel label-prior-assisted training schemes: one embed prior knowledge of foreground and background to highlight the localization chances of target moments, and the other forces the originally rough predictions to overlap with the more accurate predictions obtained from the flipped start/end prior label sequences during recovery training. We show that injecting label-prior knowledge into the model is crucial for improving performance at high IoU. In our constructed TSLM benchmark, our model termed MLP achieves a recall of 44.13 at IoU@0.7 on the BABEL dataset and 71.17 on HumanML3D (Restore), outperforming prior works. Finally, we showcase the potential of our approach in corpus-level moment retrieval. Our source code is openly accessible at https://github.com/eanson023/mlp.
Related papers
- Exploring the Capabilities of LLMs for IMU-based Fine-grained Human Activity Understanding [1.1228672751176365]
Human activity recognition (HAR) using inertial measurement units (IMUs) leverages increasingly large language models (LLMs)
Our preliminary study indicates that pretrained LLMs fail catastrophically on fine-grained HAR tasks such as air-written letter recognition, achieving only near-random guessing accuracy.
To extend this to 3D, we designed an encoder-based pipeline that maps 3D data into 2D equivalents, preserving thetemporal information for robust letter prediction.
Our end-to-end pipeline achieves 78% accuracy on word recognition with up to 5 letters in mid-air writing scenarios, establishing LLMs as viable tools for
arXiv Detail & Related papers (2025-04-02T03:42:58Z) - ReMP: Reusable Motion Prior for Multi-domain 3D Human Pose Estimation and Motion Inbetweening [10.813269931915364]
We learn rich motion from prior sequence of complete parametric models of human body shape.
Our prior can easily estimate poses in missing frames or noisy measurements.
ReMP consistently outperforms the baseline method on diverse and practical 3D motion data.
arXiv Detail & Related papers (2024-11-13T02:42:07Z) - Past Movements-Guided Motion Representation Learning for Human Motion Prediction [0.0]
We propose a self-supervised learning framework designed to enhance motion representation.
The framework consists of two stages: first, the network is pretrained through the self-reconstruction of past sequences, and the guided reconstruction of future sequences based on past movements.
Our method reduces the average prediction errors by 8.8% across Human3.6, 3DPW, and AMASS datasets.
arXiv Detail & Related papers (2024-08-04T17:00:37Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - Pose Priors from Language Models [74.61186408764559]
Language is often used to describe physical interaction, yet most 3D human pose estimation methods overlook this rich source of information.<n>We bridge this gap by leveraging large multimodal models (LMMs) as priors for reconstructing contact poses.
arXiv Detail & Related papers (2024-05-06T17:59:36Z) - HMP: Hand Motion Priors for Pose and Shape Estimation from Video [52.39020275278984]
We develop a generative motion prior specific for hands, trained on the AMASS dataset which features diverse and high-quality hand motions.
Our integration of a robust motion prior significantly enhances performance, especially in occluded scenarios.
We demonstrate our method's efficacy via qualitative and quantitative evaluations on the HO3D and DexYCB datasets.
arXiv Detail & Related papers (2023-12-27T22:35:33Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.
Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion
Data and Natural Language [4.86658723641864]
We propose a novel text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural description.
Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions.
arXiv Detail & Related papers (2023-05-25T08:32:41Z) - MotionBERT: A Unified Perspective on Learning Human Motion
Representations [46.67364057245364]
We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources.
We propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations.
We implement motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network.
arXiv Detail & Related papers (2022-10-12T19:46:25Z) - Live Stream Temporally Embedded 3D Human Body Pose and Shape Estimation [13.40702053084305]
We present a temporally embedded 3D human body pose and shape estimation (TePose) method to improve the accuracy and temporal consistency pose in live stream videos.
A multi-scale convolutional network is presented as the motion discriminator for adversarial training using datasets without any 3D labeling.
arXiv Detail & Related papers (2022-07-25T21:21:59Z) - P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose
Estimation [78.83305967085413]
This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task.
Our method outperforms state-of-the-art methods with fewer parameters and less computational overhead.
arXiv Detail & Related papers (2022-03-15T04:00:59Z) - Improving Robustness and Accuracy via Relative Information Encoding in
3D Human Pose Estimation [59.94032196768748]
We propose a relative information encoding method that yields positional and temporal enhanced representations.
Our method outperforms state-of-the-art methods on two public datasets.
arXiv Detail & Related papers (2021-07-29T14:12:19Z) - We are More than Our Joints: Predicting how 3D Bodies Move [63.34072043909123]
We train a novel variational autoencoder that generates motions from latent frequencies.
Experiments show that our method produces state-of-the-art results and realistic 3D body animations.
arXiv Detail & Related papers (2020-12-01T16:41:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.