BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for
Conversational Gestures Synthesis
- URL: http://arxiv.org/abs/2203.05297v2
- Date: Fri, 11 Mar 2022 16:19:50 GMT
- Title: BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for
Conversational Gestures Synthesis
- Authors: Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You
Zhou, Elif Bozkurt, Bo Zheng
- Abstract summary: Body-Expression-Audio-Text dataset has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages.
BEAT is the largest motion capture dataset for investigating the human gestures.
- Score: 9.95713767110021
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Achieving realistic, vivid, and human-like synthesized conversational
gestures conditioned on multi-modal data is still an unsolved problem, due to
the lack of available datasets, models and standard evaluation metrics. To
address this, we build Body-Expression-Audio-Text dataset, BEAT, which has i)
76 hours, high-quality, multi-modal data captured from 30 speakers talking with
eight different emotions and in four different languages, ii) 32 millions
frame-level emotion and semantic relevance annotations.Our statistical analysis
on BEAT demonstrates the correlation of conversational gestures with facial
expressions, emotions, and semantics, in addition to the known correlation with
audio, text, and speaker identity. Qualitative and quantitative experiments
demonstrate metrics' validness, ground truth data quality, and baseline's
state-of-the-art performance. To the best of our knowledge, BEAT is the largest
motion capture dataset for investigating the human gestures, which may
contribute to a number of different research fields including controllable
gesture synthesis, cross-modality analysis, emotional gesture recognition. The
data, code and model will be released for research.
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - Speech and Text-Based Emotion Recognizer [0.9168634432094885]
We build a balanced corpus from publicly available datasets for speech emotion recognition.
Our best system, a multi-modal speech, and text-based model, provides a performance of UA(Unweighed Accuracy) + WA (Weighed Accuracy) of 157.57 compared to the baseline algorithm performance of 119.66.
arXiv Detail & Related papers (2023-12-10T05:17:39Z) - Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks,
Methods, and Applications [20.842799581850617]
We consider the task of animating 3D facial geometry from speech signal.
Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers.
arXiv Detail & Related papers (2023-11-30T01:14:43Z) - Towards Generalizable SER: Soft Labeling and Data Augmentation for
Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech [3.86122440373248]
We propose a soft labeling system to capture gradational emotional intensities.
Using the Whisper encoder and data augmentation methods inspired by contrastive learning, our method emphasizes the temporal dynamics of emotions.
We publish our open source model weights and initial promising results after fine-tuning on Hume-Prosody.
arXiv Detail & Related papers (2023-11-15T00:09:21Z) - Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model [2.827070255699381]
diffmotion-v2 is a speech-conditional diffusion-based generative model with WavLM pre-trained model.
It can produce individual and stylized full-body co-speech gestures only using raw speech audio.
arXiv Detail & Related papers (2023-08-11T08:03:28Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation [46.8780140220063]
We present a joint audio-text model to capture contextual information for expressive speech-driven 3D facial animation.
Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio.
We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization.
arXiv Detail & Related papers (2021-12-04T01:37:22Z) - Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with
Generative Adversarial Affective Expression Learning [63.06044724907101]
We present a generative adversarial network to synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions.
Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences.
arXiv Detail & Related papers (2021-07-31T15:13:39Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.