Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion
Data and Natural Language
- URL: http://arxiv.org/abs/2305.15842v2
- Date: Wed, 4 Oct 2023 12:37:00 GMT
- Title: Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion
Data and Natural Language
- Authors: Nicola Messina, Jan Sedmidubsky, Fabrizio Falchi, Tom\'a\v{s} Rebok
- Abstract summary: We propose a novel text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural description.
Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions.
- Score: 4.86658723641864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to recent advances in pose-estimation methods, human motion can be
extracted from a common video in the form of 3D skeleton sequences. Despite
wonderful application opportunities, effective and efficient content-based
access to large volumes of such spatio-temporal skeleton data still remains a
challenging problem. In this paper, we propose a novel content-based
text-to-motion retrieval task, which aims at retrieving relevant motions based
on a specified natural-language textual description. To define baselines for
this uncharted task, we employ the BERT and CLIP language representations to
encode the text modality and successful spatio-temporal models to encode the
motion modality. We additionally introduce our transformer-based approach,
called Motion Transformer (MoT), which employs divided space-time attention to
effectively aggregate the different skeleton joints in space and time. Inspired
by the recent progress in text-to-image/video matching, we experiment with two
widely-adopted metric-learning loss functions. Finally, we set up a common
evaluation protocol by defining qualitative metrics for assessing the quality
of the retrieved motions, targeting the two recently-introduced KIT
Motion-Language and HumanML3D datasets. The code for reproducing our results is
available at https://github.com/mesnico/text-to-motion-retrieval.
Related papers
- Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval [4.454835029368504]
We focus on the recently introduced text-motion retrieval which aim to search for sequences that are most relevant to a natural motion description.
Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models.
We propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously.
We also introduce a transformer-based motion encoder, called MoT++, which employs the specified-temporal attention to process sequences of skeleton data.
arXiv Detail & Related papers (2024-07-02T09:43:47Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion [21.750804738752105]
We introduce the novel task of text-based human motion grounding (THMG)
TM-Mamba is a unified model that integrates temporal global context, language query control, and spatial graph topology with only linear memory cost.
For evaluation, we introduce BABEL-Grounding, the first text-motion dataset that provides detailed textual descriptions of human actions along with their corresponding temporal segments.
arXiv Detail & Related papers (2024-04-17T13:33:09Z) - GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts [48.28000728061778]
We propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene.
Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model.
arXiv Detail & Related papers (2024-04-08T18:24:12Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - Text-driven Video Prediction [83.04845684117835]
We propose a new task called Text-driven Video Prediction (TVP)
Taking the first frame and text caption as inputs, this task aims to synthesize the following frames.
To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM)
arXiv Detail & Related papers (2022-10-06T12:43:07Z) - TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of
3D Human Motions and Texts [20.336481832461168]
Inspired by the strong ties between vision and language, our paper aims to explore the generation of 3D human full-body motions from texts.
We propose the use of motion token, a discrete and compact motion representation.
Our approach is flexible, could be used for both text2motion and motion2text tasks.
arXiv Detail & Related papers (2022-07-04T19:52:18Z) - TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions.
We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data.
We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.