Cross-Modal Retrieval for Motion and Text via DopTriple Loss
- URL: http://arxiv.org/abs/2305.04195v3
- Date: Tue, 3 Oct 2023 04:42:09 GMT
- Title: Cross-Modal Retrieval for Motion and Text via DopTriple Loss
- Authors: Sheng Yan, Yang Liu, Haoqiang Wang, Xin Du, Mengyuan Liu, Hong Liu
- Abstract summary: Cross-modal retrieval of image-text and video-text is a prominent research area in computer vision and natural language processing.
We utilize a concise yet effective dual-unimodal transformer encoder for tackling this task.
- Score: 31.206130522960795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal retrieval of image-text and video-text is a prominent research
area in computer vision and natural language processing. However, there has
been insufficient attention given to cross-modal retrieval between human motion
and text, despite its wide-ranging applicability. To address this gap, we
utilize a concise yet effective dual-unimodal transformer encoder for tackling
this task. Recognizing that overlapping atomic actions in different human
motion sequences can lead to semantic conflicts between samples, we explore a
novel triplet loss function called DropTriple Loss. This loss function discards
false negative samples from the negative sample set and focuses on mining
remaining genuinely hard negative samples for triplet training, thereby
reducing violations they cause. We evaluate our model and approach on the
HumanML3D and KIT Motion-Language datasets. On the latest HumanML3D dataset, we
achieve a recall of 62.9% for motion retrieval and 71.5% for text retrieval
(both based on R@10). The source code for our approach is publicly available at
https://github.com/eanson023/rehamot.
Related papers
- Occlusion-Aware 3D Motion Interpretation for Abnormal Behavior Detection [10.782354892545651]
We present OAD2D, which discriminates against motion abnormalities based on reconstructing 3D coordinates of mesh vertices and human joints from monocular videos.
We reformulate the abnormal posture estimation by coupling it with Motion to Text (M2T) model in which, the VQVAE is employed to quantize motion features.
Our approach demonstrates the robustness of abnormal behavior detection against severe and self-occlusions, as it reconstructs human motion trajectories in global coordinates.
arXiv Detail & Related papers (2024-07-23T18:41:16Z) - AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation [55.179287851188036]
We introduce a novel all-in-one-stage framework, AiOS, for expressive human pose and shape recovery without an additional human detection step.
We first employ a human token to probe a human location in the image and encode global features for each instance.
Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature.
arXiv Detail & Related papers (2024-03-26T17:59:23Z) - RoHM: Robust Human Motion Reconstruction via Diffusion [58.63706638272891]
RoHM is an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos.
It conditioned on noisy and occluded input data, reconstructs complete, plausible motions in consistent global coordinates.
Our method outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time.
arXiv Detail & Related papers (2024-01-16T18:57:50Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion
Synthesis [59.465092047829835]
We present TMR, a simple yet effective approach for text to 3D human motion retrieval.
Our method extends the state-of-the-art text-to-motion synthesis model TEMOS.
We show that maintaining the motion generation loss, along with the contrastive training, is crucial to obtain good performance.
arXiv Detail & Related papers (2023-05-02T17:52:41Z) - Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning [25.230786853723203]
We propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages.
We use Machine Translation to construct pseudo-parallel sentence pairs for low-resource languages.
We introduce a multi-view self-distillation method to learn noise-robust target-language representations.
arXiv Detail & Related papers (2022-08-26T09:32:24Z) - Intra-Modal Constraint Loss For Image-Text Retrieval [10.496611712280972]
Cross-modal retrieval has drawn much attention in computer vision and natural language processing domains.
With the development of convolutional and recurrent neural networks, the bottleneck of retrieval across image-text modalities is no longer the extraction of image and text features but an efficient loss function learning in embedding space.
This paper proposes a method for learning joint embedding of images and texts using an intra-modal constraint loss function to reduce the violation of negative pairs from the same homogeneous modality.
arXiv Detail & Related papers (2022-07-11T17:21:25Z) - Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences
for Image-Text Retrieval [19.161248757493386]
We propose our TAiloring neGative Sentences with Discrimination and Correction (TAGS-DC) to generate synthetic sentences automatically as negative samples.
To keep the difficulty during training, we mutually improve the retrieval and generation through parameter sharing.
In experiments, we verify the effectiveness of our model on MS-COCO and Flickr30K compared with current state-of-the-art models.
arXiv Detail & Related papers (2021-11-05T09:36:41Z) - Delving into Localization Errors for Monocular 3D Object Detection [85.77319416168362]
Estimating 3D bounding boxes from monocular images is an essential component in autonomous driving.
In this work, we quantify the impact introduced by each sub-task and find the localization error' is the vital factor in restricting monocular 3D detection.
arXiv Detail & Related papers (2021-03-30T10:38:01Z) - Weakly Supervised Generative Network for Multiple 3D Human Pose
Hypotheses [74.48263583706712]
3D human pose estimation from a single image is an inverse problem due to the inherent ambiguity of the missing depth.
We propose a weakly supervised deep generative network to address the inverse problem.
arXiv Detail & Related papers (2020-08-13T09:26:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.