Related papers: Motion2Language, unsupervised learning of synchronized semantic motion segmentation

Motion2Language, unsupervised learning of synchronized semantic motion segmentation

URL: http://arxiv.org/abs/2310.10594v2
Date: Wed, 13 Dec 2023 17:29:15 GMT
Title: Motion2Language, unsupervised learning of synchronized semantic motion segmentation
Authors: Karim Radouane, Andon Tchechmedjiev, Julien Lagarde, Sylvie Ranwez
Abstract summary: We investigate building a sequence to sequence architecture for motion to language translation and synchronization. The aim is to translate motion capture inputs into English natural-language descriptions, such that the descriptions are generated synchronously with the actions performed. We propose a new recurrent formulation of local attention that is suited for synchronous/live text generation, as well as an improved motion encoder architecture.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we investigate building a sequence to sequence architecture for motion to language translation and synchronization. The aim is to translate motion capture inputs into English natural-language descriptions, such that the descriptions are generated synchronously with the actions performed, enabling semantic segmentation as a byproduct, but without requiring synchronized training data. We propose a new recurrent formulation of local attention that is suited for synchronous/live text generation, as well as an improved motion encoder architecture better suited to smaller data and for synchronous generation. We evaluate both contributions in individual experiments, using the standard BLEU4 metric, as well as a simple semantic equivalence measure, on the KIT motion language dataset. In a follow-up experiment, we assess the quality of the synchronization of generated text in our proposed approaches through multiple evaluation metrics. We find that both contributions to the attention mechanism and the encoder architecture additively improve the quality of generated text (BLEU and semantic equivalence), but also of synchronization. Our code is available at https://github.com/rd20karim/M2T-Segmentation/tree/main

Related papers

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization [94.82127738291749]
JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. New benchmark, JavisBench, consists of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios.
arXiv Detail & Related papers (2025-03-30T09:40:42Z)
Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization [129.43937834515688]
We propose a new COllaborative Temporal consistEncy Learning (COTEL) framework to strengthen the video-language alignment. Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs.
arXiv Detail & Related papers (2025-03-22T05:04:12Z)
Transformer with Controlled Attention for Synchronous Motion Captioning [0.0]
This paper addresses a challenging task, synchronous motion captioning, that aim to generate a language description synchronized with human motion sequences. Our method introduces mechanisms to control self- and cross-attention distributions of the Transformer, allowing interpretability and time-aligned text generation. We demonstrate the superior performance of our approach through evaluation on the two available benchmark datasets, KIT-ML and HumanML3D.
arXiv Detail & Related papers (2024-09-13T20:30:29Z)
An Automatic Quality Metric for Evaluating Simultaneous Interpretation [13.009481258370702]
Simultaneous interpretation (SI) starts translation before the original speech has finished. We propose an automatic evaluation metric for SI and simultaneous machine translation (SiMT) focusing on word order synchronization.
arXiv Detail & Related papers (2024-07-09T08:21:40Z)
Sequence Shortening for Context-Aware Machine Translation [5.803309695504831]
We show that a special case of multi-encoder architecture achieves higher accuracy on contrastive datasets. We introduce two novel methods - Latent Grouping and Latent Selecting, where the network learns to group tokens or selects the tokens to be cached as context.
arXiv Detail & Related papers (2024-02-02T13:55:37Z)
Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z)
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD) The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z)
Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation [51.399695200838586]
We propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. Experiments on it,es,de->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
arXiv Detail & Related papers (2023-10-23T11:00:27Z)
Scalable Learning of Latent Language Structure With Logical Offline Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text. As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z)
Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language [4.86658723641864]
We propose a novel text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural description. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions.
arXiv Detail & Related papers (2023-05-25T08:32:41Z)
Neural Machine Translation with Contrastive Translation Memories [71.86990102704311]
Retrieval-augmented Neural Machine Translation models have been successful in many translation scenarios. We propose a new retrieval-augmented NMT to model contrastively retrieved translation memories that are holistically similar to the source sentence. In training phase, a Multi-TM contrastive learning objective is introduced to learn salient feature of each TM with respect to target sentence.
arXiv Detail & Related papers (2022-12-06T17:10:17Z)
Bilingual Synchronization: Restoring Translational Relationships with Editing Operations [2.0411082897313984]
We consider a more general setting which assumes an initial target sequence, that must be transformed into a valid translation of the source. Our results suggest that one single generic edit-based system, once fine-tuned, can compare with, or even outperform, dedicated systems specifically trained for these tasks.
arXiv Detail & Related papers (2022-10-24T12:25:44Z)
Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task. Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.