Say Anything with Any Style
- URL: http://arxiv.org/abs/2403.06363v2
- Date: Wed, 13 Mar 2024 01:37:12 GMT
- Title: Say Anything with Any Style
- Authors: Shuai Tan and Bin Ji and Yu Ding and Ye Pan
- Abstract summary: Say Anything withAny Style queries the discrete style representation via a generative model with a learned style codebook.
Our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression.
- Score: 9.50806457742173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating stylized talking head with diverse head motions is crucial for
achieving natural-looking videos but still remains challenging. Previous works
either adopt a regressive method to capture the speaking style, resulting in a
coarse style that is averaged across all training data, or employ a universal
network to synthesize videos with different styles which causes suboptimal
performance. To address these, we propose a novel dynamic-weight method, namely
Say Anything withAny Style (SAAS), which queries the discrete style
representation via a generative model with a learned style codebook.
Specifically, we develop a multi-task VQ-VAE that incorporates three closely
related tasks to learn a style codebook as a prior for style extraction. This
discrete prior, along with the generative model, enhances the precision and
robustness when extracting the speaking styles of the given style clips. By
utilizing the extracted style, a residual architecture comprising a canonical
branch and style-specific branch is employed to predict the mouth shapes
conditioned on any driving audio while transferring the speaking style from the
source to any desired one. To adapt to different speaking styles, we steer
clear of employing a universal network by exploring an elaborate HyperStyle to
produce the style-specific weights offset for the style branch. Furthermore, we
construct a pose generator and a pose codebook to store the quantized pose
representation, allowing us to sample diverse head motions aligned with the
audio and the extracted style. Experiments demonstrate that our approach
surpasses state-of-theart methods in terms of both lip-synchronization and
stylized expression. Besides, we extend our SAAS to video-driven style editing
field and achieve satisfactory performance.
Related papers
- SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model [66.34929233269409]
Talking Head Generation (THG) is an important task with broad application prospects in various fields such as digital humans, film production, and virtual reality.
We propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG.
Our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.
arXiv Detail & Related papers (2024-09-05T06:27:32Z) - StyleShot: A Snapshot on Any Style [20.41380860802149]
We show that, a good style representation is crucial and sufficient for generalized style transfer without test-time tuning.
We achieve this through constructing a style-aware encoder and a well-organized style dataset called StyleGallery.
We highlight that, our approach, named StyleShot, is simple yet effective in mimicking various desired styles, without test-time tuning.
arXiv Detail & Related papers (2024-07-01T16:05:18Z) - StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter [78.75422651890776]
StyleCrafter is a generic method that enhances pre-trained T2V models with a style control adapter.
To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image.
StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images.
arXiv Detail & Related papers (2023-12-01T03:53:21Z) - DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models [24.401443462720135]
We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder.
In particular, our style includes the generation of head poses, thereby enhancing user perception.
We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
arXiv Detail & Related papers (2023-09-30T17:01:18Z) - Visual Captioning at Will: Describing Images and Videos Guided by a Few
Stylized Sentences [49.66987347397398]
Few-Shot Stylized Visual Captioning aims to generate captions in any desired style, using only a few examples as guidance during inference.
We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module.
arXiv Detail & Related papers (2023-07-31T04:26:01Z) - StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized
Tokenizer of a Large-Scale Generative Model [64.26721402514957]
We propose StylerDALLE, a style transfer method that uses natural language to describe abstract art styles.
Specifically, we formulate the language-guided style transfer task as a non-autoregressive token sequence translation.
To incorporate style information, we propose a Reinforcement Learning strategy with CLIP-based language supervision.
arXiv Detail & Related papers (2023-03-16T12:44:44Z) - StyleTalk: One-shot Talking Head Generation with Controllable Speaking
Styles [43.12918949398099]
We propose a one-shot style-controllable talking face generation framework.
We aim to attain a speaking style from an arbitrary reference speaking video.
We then drive the one-shot portrait to speak with the reference speaking style and another piece of audio.
arXiv Detail & Related papers (2023-01-03T13:16:24Z) - Domain Enhanced Arbitrary Image Style Transfer via Contrastive Learning [84.8813842101747]
Contrastive Arbitrary Style Transfer (CAST) is a new style representation learning and style transfer method via contrastive learning.
Our framework consists of three key components, i.e., a multi-layer style projector for style code encoding, a domain enhancement module for effective learning of style distribution, and a generative network for image style transfer.
arXiv Detail & Related papers (2022-05-19T13:11:24Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.