The DiffuseStyleGesture+ entry to the GENEA Challenge 2023
- URL: http://arxiv.org/abs/2308.13879v1
- Date: Sat, 26 Aug 2023 13:34:17 GMT
- Title: The DiffuseStyleGesture+ entry to the GENEA Challenge 2023
- Authors: Sicheng Yang, Haiwei Xue, Zhensong Zhang, Minglei Li, Zhiyong Wu,
Xiaofei Wu, Songcen Xu, Zonghong Dai
- Abstract summary: We introduce the DiffuseStyleGesture+, our solution for the Generation and Evaluation of Non-verbal Behavior for Embodied Agents (GENEA) Challenge 2023.
Our proposed model, DiffuseStyleGesture+, leverages a diffusion model to generate gestures automatically.
It incorporates a variety of modalities, including audio, text, speaker ID, and seed gestures.
- Score: 16.297790031478634
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce the DiffuseStyleGesture+, our solution for the
Generation and Evaluation of Non-verbal Behavior for Embodied Agents (GENEA)
Challenge 2023, which aims to foster the development of realistic, automated
systems for generating conversational gestures. Participants are provided with
a pre-processed dataset and their systems are evaluated through crowdsourced
scoring. Our proposed model, DiffuseStyleGesture+, leverages a diffusion model
to generate gestures automatically. It incorporates a variety of modalities,
including audio, text, speaker ID, and seed gestures. These diverse modalities
are mapped to a hidden space and processed by a modified diffusion model to
produce the corresponding gesture for a given speech input. Upon evaluation,
the DiffuseStyleGesture+ demonstrated performance on par with the top-tier
models in the challenge, showing no significant differences with those models
in human-likeness, appropriateness for the interlocutor, and achieving
competitive performance with the best model on appropriateness for agent
speech. This indicates that our model is competitive and effective in
generating realistic and appropriate gestures for given speech. The code,
pre-trained models, and demos are available at
https://github.com/YoungSeng/DiffuseStyleGesture/tree/DiffuseStyleGesturePlus/BEAT-TWH-main.
Related papers
- Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens [53.99177152562075]
Scaling up autoregressive models in vision has not proven as beneficial as in large language models.
We focus on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed order using BERT- or GPT-like transformer architectures.
Our results show that while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends.
arXiv Detail & Related papers (2024-10-17T17:59:59Z) - From Modular to End-to-End Speaker Diarization [3.079020586262228]
We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx.
We describe an approach for generating synthetic data which resembles real conversations in terms of speaker turns and overlaps.
We show how this method generating simulated conversations'' allows for better performance than using a previously proposed method for creating simulated mixtures'' when training the popular EEND.
arXiv Detail & Related papers (2024-06-27T15:09:39Z) - Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference [5.711221299998126]
Persona-Gestor is a novel end-to-end generative model designed to generate highly personalized 3D full-body gestures.
The model combines a fuzzy feature extractor and a non-autoregressive Adaptive Layer Normalization (AdaLN) transformer diffusion architecture.
Persona-Gestor improves the system's usability and generalization capabilities.
arXiv Detail & Related papers (2024-03-16T04:40:10Z) - Bridging Generative and Discriminative Models for Unified Visual
Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors.
Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages.
The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length.
By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z) - Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio
Representation [18.04996323708772]
This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023.
We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture.
The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model.
arXiv Detail & Related papers (2023-09-11T13:51:06Z) - Self-Supervised Representation Learning for Speech Using Visual
Grounding and Masked Language Modeling [13.956691231452336]
FaST-VGS is a Transformer-based model that learns to associate raw speech waveforms with semantically related images.
FaST-VGS+ is learned in a multi-task fashion with a masked language modeling objective.
We show that our models perform competitively on the ABX task, outperform all other concurrent submissions on the Syntactic and Semantic tasks, and nearly match the best system on the Lexical task.
arXiv Detail & Related papers (2022-02-07T22:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.