GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents
- URL: http://arxiv.org/abs/2303.14613v4
- Date: Mon, 16 Oct 2023 05:37:06 GMT
- Title: GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents
- Authors: Tenglong Ao, Zeyi Zhang, Libin Liu
- Abstract summary: GestureDiffuCLIP is a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control.
Our system learns a latent diffusion model to generate high-quality gestures and infuses the CLIP representations of style into the generator.
Our system can be extended to allow fine-grained style control of individual body parts.
- Score: 3.229105662984031
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The automatic generation of stylized co-speech gestures has recently received
increasing attention. Previous systems typically allow style control via
predefined text labels or example motion clips, which are often not flexible
enough to convey user intent accurately. In this work, we present
GestureDiffuCLIP, a neural network framework for synthesizing realistic,
stylized co-speech gestures with flexible style control. We leverage the power
of the large-scale Contrastive-Language-Image-Pre-training (CLIP) model and
present a novel CLIP-guided mechanism that extracts efficient style
representations from multiple input modalities, such as a piece of text, an
example motion clip, or a video. Our system learns a latent diffusion model to
generate high-quality gestures and infuses the CLIP representations of style
into the generator via an adaptive instance normalization (AdaIN) layer. We
further devise a gesture-transcript alignment mechanism that ensures a
semantically correct gesture generation based on contrastive learning. Our
system can also be extended to allow fine-grained style control of individual
body parts. We demonstrate an extensive set of examples showing the flexibility
and generalizability of our model to a variety of style descriptions. In a user
study, we show that our system outperforms the state-of-the-art approaches
regarding human likeness, appropriateness, and style correctness.
Related papers
- StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models [35.732715025002705]
StyleInject is a specialized fine-tuning approach tailored for text-to-image models.
It adapts to varying styles by adjusting the variance of visual features based on the characteristics of the input signal.
It proves particularly effective in learning from and enhancing a range of advanced, community-fine-tuned generative models.
arXiv Detail & Related papers (2024-01-25T04:53:03Z) - UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons [16.52004713662265]
We present a novel diffusion model-based speech-driven gesture synthesis approach, trained on multiple gesture datasets with different skeletons.
We then capture the correlation between speech and gestures based on a diffusion model architecture using cross-local attention and self-attention.
Experiments show that UnifiedGesture outperforms recent approaches on speech-driven gesture generation in terms of CCA, FGD, and human-likeness.
arXiv Detail & Related papers (2023-09-13T16:07:25Z) - CALM: Conditional Adversarial Latent Models for Directable Virtual
Characters [71.66218592749448]
We present Conditional Adversarial Latent Models (CALM), an approach for generating diverse and directable behaviors for user-controlled interactive virtual characters.
Using imitation learning, CALM learns a representation of movement that captures the complexity of human motion, and enables direct control over character movements.
arXiv Detail & Related papers (2023-05-02T09:01:44Z) - StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized
Tokenizer of a Large-Scale Generative Model [64.26721402514957]
We propose StylerDALLE, a style transfer method that uses natural language to describe abstract art styles.
Specifically, we formulate the language-guided style transfer task as a non-autoregressive token sequence translation.
To incorporate style information, we propose a Reinforcement Learning strategy with CLIP-based language supervision.
arXiv Detail & Related papers (2023-03-16T12:44:44Z) - A Unified Arbitrary Style Transfer Framework via Adaptive Contrastive
Learning [84.8813842101747]
Unified Contrastive Arbitrary Style Transfer (UCAST) is a novel style representation learning and transfer framework.
We present an adaptive contrastive learning scheme for style transfer by introducing an input-dependent temperature.
Our framework consists of three key components, i.e., a parallel contrastive learning scheme for style representation and style transfer, a domain enhancement module for effective learning of style distribution, and a generative network for style transfer.
arXiv Detail & Related papers (2023-03-09T04:35:00Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech [6.8527462303619195]
We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example.
Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scaling of style embeddings.
In a user study, we show that our model outperforms previous state-of-the-art techniques in naturalness of motion, for speech, and style portrayal.
arXiv Detail & Related papers (2022-09-15T18:34:30Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z) - Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker
Conditional-Mixture Approach [46.50460811211031]
Key challenge is to learn a model that generates gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'
We propose Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures.
As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings.
arXiv Detail & Related papers (2020-07-24T15:01:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.