Multimodal analysis of the predictability of hand-gesture properties
- URL: http://arxiv.org/abs/2108.05762v1
- Date: Thu, 12 Aug 2021 14:16:00 GMT
- Title: Multimodal analysis of the predictability of hand-gesture properties
- Authors: Taras Kucherenko, Rajmund Nagy, Michael Neff, Hedvig Kjellstr\"om,
Gustav Eje Henter
- Abstract summary: Embodied conversational agents benefit from being able to accompany their speech with gestures.
We investigate which gesture properties can be predicted from speech text and/or audio using contemporary deep learning.
- Score: 10.332200713176768
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embodied conversational agents benefit from being able to accompany their
speech with gestures. Although many data-driven approaches to gesture
generation have been proposed in recent years, it is still unclear whether such
systems can consistently generate gestures that convey meaning. We investigate
which gesture properties (phase, category, and semantics) can be predicted from
speech text and/or audio using contemporary deep learning. In extensive
experiments, we show that gesture properties related to gesture meaning
(semantics and category) are predictable from text features (time-aligned BERT
embeddings) alone, but not from prosodic audio features, while rhythm-related
gesture properties (phase) on the other hand can be predicted from either
audio, text (with word-level timing information), or both. These results are
encouraging as they indicate that it is possible to equip an embodied agent
with content-wise meaningful co-speech gestures using a machine-learning model.
Related papers
- Integrating Representational Gestures into Automatically Generated Embodied Explanations and its Effects on Understanding and Interaction Quality [0.0]
This study investigates how different types of gestures influence perceived interaction quality and listener understanding.
Our model combines beat gestures generated by a learned speech-driven module with manually captured iconic gestures.
Findings indicate that neither the use of iconic gestures alone nor their combination with beat gestures outperforms the baseline or beat-only conditions in terms of understanding.
arXiv Detail & Related papers (2024-06-18T12:23:00Z) - Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence.
Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit.
Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Deep Neural Convolutive Matrix Factorization for Articulatory
Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores.
Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z) - Speech2Properties2Gestures: Gesture-Property Prediction as a Tool for
Generating Representational Gestures from Speech [9.859003149671807]
We propose a new framework for gesture generation, aiming to allow data-driven approaches to produce semantically rich gestures.
Our approach first predicts whether to gesture, followed by a prediction of the gesture properties.
arXiv Detail & Related papers (2021-06-28T14:07:59Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.