Sequence-to-Sequence Predictive Model: From Prosody To Communicative
Gestures
- URL: http://arxiv.org/abs/2008.07643v2
- Date: Fri, 23 Apr 2021 21:03:40 GMT
- Title: Sequence-to-Sequence Predictive Model: From Prosody To Communicative
Gestures
- Authors: Fajrian Yunus, Chlo\'e Clavel, Catherine Pelachaud
- Abstract summary: We develop a model based on a recurrent neural network with attention mechanism.
We find that the model can predict better certain gesture classes than others.
We also find that a model trained on the data of one given speaker also works for the other speaker of the same conversation.
- Score: 2.578242050187029
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Communicative gestures and speech acoustic are tightly linked. Our objective
is to predict the timing of gestures according to the acoustic. That is, we
want to predict when a certain gesture occurs. We develop a model based on a
recurrent neural network with attention mechanism. The model is trained on a
corpus of natural dyadic interaction where the speech acoustic and the gesture
phases and types have been annotated. The input of the model is a sequence of
speech acoustic and the output is a sequence of gesture classes. The classes we
are using for the model output is based on a combination of gesture phases and
gesture types. We use a sequence comparison technique to evaluate the model
performance. We find that the model can predict better certain gesture classes
than others. We also perform ablation studies which reveal that fundamental
frequency is a relevant feature for gesture prediction task. In another
sub-experiment, we find that including eyebrow movements as acting as beat
gesture improves the performance. Besides, we also find that a model trained on
the data of one given speaker also works for the other speaker of the same
conversation. We also perform a subjective experiment to measure how
respondents judge the naturalness, the time consistency, and the semantic
consistency of the generated gesture timing of a virtual agent. Our respondents
rate the output of our model favorably.
Related papers
- Counterfactual Generation from Language Models [64.55296662926919]
We show that counterfactual reasoning is conceptually distinct from interventions.
We propose a framework for generating true string counterfactuals.
Our experiments demonstrate that the approach produces meaningful counterfactuals.
arXiv Detail & Related papers (2024-11-11T17:57:30Z) - Iconic Gesture Semantics [87.00251241246136]
Informational evaluation is spelled out as extended exemplification (extemplification) in terms of perceptual classification of a gesture's visual iconic model.
We argue that the perceptual classification of instances of visual communication requires a notion of meaning different from Frege/Montague frameworks.
An iconic gesture semantics is introduced which covers the full range from gesture representations over model-theoretic evaluation to inferential interpretation in dynamic semantic frameworks.
arXiv Detail & Related papers (2024-04-29T13:58:03Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion
Models [22.000197530493445]
We show that diffusion models are an excellent fit for synthesising human motion that co-occurs with audio.
We adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power.
Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality.
arXiv Detail & Related papers (2022-11-17T17:41:00Z) - Speech Drives Templates: Co-Speech Gesture Synthesis with Learned
Templates [30.32106465591015]
Co-speech gesture generation is to synthesize a gesture sequence that not only looks real but also matches with the input speech audio.
Our method generates the movements of a complete upper body, including arms, hands, and the head.
arXiv Detail & Related papers (2021-08-18T07:53:36Z) - Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent
Representations [22.14238843571225]
We propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face.
The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images.
We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results.
arXiv Detail & Related papers (2021-07-26T07:36:02Z) - Multi-level Motion Attention for Human Motion Prediction [132.29963836262394]
We study the use of different types of attention, computed at joint, body part, and full pose levels.
Our experiments on Human3.6M, AMASS and 3DPW validate the benefits of our approach for both periodical and non-periodical actions.
arXiv Detail & Related papers (2021-06-17T08:08:11Z) - Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [68.76620947298595]
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text.
We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
arXiv Detail & Related papers (2021-06-15T18:03:48Z) - Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z) - Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning [60.20205278845412]
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised.
This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation.
We propose a reinforcement learning based framework to train an agent to make this decision.
arXiv Detail & Related papers (2020-08-07T11:48:05Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.