TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion
Synthesis
- URL: http://arxiv.org/abs/2305.00976v2
- Date: Fri, 25 Aug 2023 09:35:46 GMT
- Title: TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion
Synthesis
- Authors: Mathis Petrovich, Michael J. Black, G\"ul Varol
- Abstract summary: We present TMR, a simple yet effective approach for text to 3D human motion retrieval.
Our method extends the state-of-the-art text-to-motion synthesis model TEMOS.
We show that maintaining the motion generation loss, along with the contrastive training, is crucial to obtain good performance.
- Score: 59.465092047829835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present TMR, a simple yet effective approach for text to 3D
human motion retrieval. While previous work has only treated retrieval as a
proxy evaluation metric, we tackle it as a standalone task. Our method extends
the state-of-the-art text-to-motion synthesis model TEMOS, and incorporates a
contrastive loss to better structure the cross-modal latent space. We show that
maintaining the motion generation loss, along with the contrastive training, is
crucial to obtain good performance. We introduce a benchmark for evaluation and
provide an in-depth analysis by reporting results on several protocols. Our
extensive experiments on the KIT-ML and HumanML3D datasets show that TMR
outperforms the prior work by a significant margin, for example reducing the
median rank from 54 to 19. Finally, we showcase the potential of our approach
on moment retrieval. Our code and models are publicly available at
https://mathis.petrovich.fr/tmr.
Related papers
- mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval [67.50604814528553]
We first introduce a text encoder enhanced with RoPE and unpadding, pre-trained in a native 8192-token context.
Then we construct a hybrid TRM and a cross-encoder reranker by contrastive learning.
arXiv Detail & Related papers (2024-07-29T03:12:28Z) - Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval [4.454835029368504]
We focus on the recently introduced text-motion retrieval which aim to search for sequences that are most relevant to a natural motion description.
Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models.
We propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously.
We also introduce a transformer-based motion encoder, called MoT++, which employs the specified-temporal attention to process sequences of skeleton data.
arXiv Detail & Related papers (2024-07-02T09:43:47Z) - Black-box Adversarial Attacks against Dense Retrieval Models: A
Multi-view Contrastive Learning Method [115.29382166356478]
We introduce the adversarial retrieval attack (AREA) task.
It is meant to trick DR models into retrieving a target document that is outside the initial set of candidate documents retrieved by the DR model.
We find that the promising results that have previously been reported on attacking NRMs, do not generalize to DR models.
We propose to formalize attacks on DR models as a contrastive learning problem in a multi-view representation space.
arXiv Detail & Related papers (2023-08-19T00:24:59Z) - Cross-Modal Retrieval for Motion and Text via DopTriple Loss [31.206130522960795]
Cross-modal retrieval of image-text and video-text is a prominent research area in computer vision and natural language processing.
We utilize a concise yet effective dual-unimodal transformer encoder for tackling this task.
arXiv Detail & Related papers (2023-05-07T05:40:48Z) - Re-Evaluating LiDAR Scene Flow for Autonomous Driving [80.37947791534985]
Popular benchmarks for self-supervised LiDAR scene flow have unrealistic rates of dynamic motion, unrealistic correspondences, and unrealistic sampling patterns.
We evaluate a suite of top methods on a suite of real-world datasets.
We show that despite the emphasis placed on learning, most performance gains are caused by pre- and post-processing steps.
arXiv Detail & Related papers (2023-04-04T22:45:50Z) - T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete
Representations [34.61255243742796]
We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations.
Despite its simplicity, our T2M-GPT shows better performance than competitive approaches.
arXiv Detail & Related papers (2023-01-15T09:34:42Z) - Back to MLP: A Simple Baseline for Human Motion Prediction [59.18776744541904]
This paper tackles the problem of human motion prediction, consisting in forecasting future body poses from historically observed sequences.
We show that the performance of these approaches can be surpassed by a light-weight and purely architectural architecture with only 0.14M parameters.
An exhaustive evaluation on Human3.6M, AMASS and 3DPW datasets shows that our method, which we dub siMLPe, consistently outperforms all other approaches.
arXiv Detail & Related papers (2022-07-04T16:35:58Z) - TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining.
We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time.
Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.