Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis
- URL: http://arxiv.org/abs/2412.06786v1
- Date: Mon, 09 Dec 2024 18:59:46 GMT
- Title: Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis
- Authors: M. Hamza Mughal, Rishabh Dabral, Merel C. J. Scholman, Vera Demberg, Christian Theobalt,
- Abstract summary: RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures.
We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures.
We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
- Score: 55.45253486141108
- License:
- Abstract: Non-verbal communication often comprises of semantically rich gestures that help convey the meaning of an utterance. Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. Our neuro-explicit gesture generation approach is designed to produce semantic gestures grounded in interpretable linguistic knowledge. We achieve this by using explicit domain knowledge to retrieve exemplar motions from a database of co-speech gestures. Once retrieved, we then inject these semantic exemplar gestures into our diffusion-based gesture generation pipeline using DDIM inversion and retrieval guidance at the inference time without any need of training. Further, we propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence. Our comparative evaluations demonstrate the validity of our approach against recent gesture generation approaches. The reader is urged to explore the results on our project page.
Related papers
- Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency.
Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling.
Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z) - Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence.
Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit.
Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length.
By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z) - Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio
Representation [18.04996323708772]
This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023.
We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture.
The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model.
arXiv Detail & Related papers (2023-09-11T13:51:06Z) - AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech
Gesture Synthesis [0.0]
We present an approach to pre-train partial gesture sequences using a generative adversarial network with a quantization pipeline.
By learning the mapping of a latent space representation as opposed to directly mapping it to a vector representation, this framework facilitates the generation of highly realistic and expressive gestures.
arXiv Detail & Related papers (2023-05-02T07:59:38Z) - GesGPT: Speech Gesture Synthesis With Text Parsing from ChatGPT [8.18076191897917]
Gesture synthesis has gained significant attention as a critical research field.
Deep learning-based approaches have achieved remarkable progress, but they often overlook the rich semantic information present in the text.
We propose GesGPT, a novel approach to gesture generation that leverages the semantic analysis capabilities of large language models.
arXiv Detail & Related papers (2023-03-23T03:30:30Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.