MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation
- URL: http://arxiv.org/abs/2408.03312v1
- Date: Tue, 6 Aug 2024 17:29:01 GMT
- Title: MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation
- Authors: Xiaofeng Mao, Zhengkai Jiang, Qilin Wang, Chencan Fu, Jiangning Zhang, Jiafu Wu, Yabiao Wang, Chengjie Wang, Wei Li, Mingmin Chi,
- Abstract summary: We introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G.
This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures.
Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6$times$ faster than traditional diffusion transformers.
- Score: 44.74056930805525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in the field of Diffusion Transformers have substantially improved the generation of high-quality 2D images, 3D videos, and 3D shapes. However, the effectiveness of the Transformer architecture in the domain of co-speech gesture generation remains relatively unexplored, as prior methodologies have predominantly employed the Convolutional Neural Network (CNNs) or simple a few transformer layers. In an attempt to bridge this research gap, we introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G, which directly implements the denoising process on gesture sequences. To enhance the contextual reasoning capability of temporally aligned speech-driven gestures, we incorporate a novel Masked Diffusion Transformer. This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures, thereby expediting the learning process and leading to coherent and realistic motions. Apart from audio, Our MDT-A2G model also integrates multi-modal information, encompassing text, emotion, and identity. Furthermore, we propose an efficient inference strategy that diminishes the denoising computation by leveraging previously calculated results, thereby achieving a speedup with negligible performance degradation. Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6$\times$ faster than traditional diffusion transformers and an inference speed that is 5.7$\times$ than the standard diffusion model.
Related papers
- DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 [6.6954598568836925]
DiM-Gestor is an end-to-end generative model leveraging the Mamba-2 architecture.
A fuzzy feature extractor and a speech-to-gesture mapping module are built on the Mamba-2.
Our approach delivers competitive results and significantly reduces memory usage, approximately 2.4 times, and enhances inference speeds by 2 to 4 times.
arXiv Detail & Related papers (2024-11-23T08:02:03Z) - Re-Parameterization of Lightweight Transformer for On-Device Speech Emotion Recognition [10.302458835329539]
We introduce a new method, namely Transformer Re- parameterization, to boost the performance of lightweight Transformer models.
Experimental results show that our proposed method consistently improves the performance of lightweight Transformers, even making them comparable to large models.
arXiv Detail & Related papers (2024-11-14T10:36:19Z) - Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs.
We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation.
With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z) - DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion
Transformer [110.32147183360843]
Speech-driven 3D facial animation is important for many multimedia applications.
Recent work has shown promise in using either Diffusion models or Transformer architectures for this task.
We present DiffSpeaker, a Transformer-based network equipped with novel biased conditional attention modules.
arXiv Detail & Related papers (2024-02-08T14:39:16Z) - MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers [30.924202893340087]
State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks.
This paper breaks down the text-based video editing task into two stages.
First, we leverage an pre-trained text-to-image diffusion model to simultaneously edit fews in a zero-shot way.
Second, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers.
arXiv Detail & Related papers (2023-12-19T07:05:39Z) - VDT: General-purpose Video Diffusion Transformers via Mask Modeling [62.71878864360634]
Video Diffusion Transformer (VDT) pioneers the use of transformers in diffusion-based video generation.
We propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.
arXiv Detail & Related papers (2023-05-22T17:59:45Z) - MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer [158.06850125920923]
diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image.
We propose a mask latent modeling scheme to explicitly enhance the DPMs' ability to contextual relation learning among object semantic parts in an image.
Experimental results show that MDTv2 achieves superior image synthesis performance, e.g., a new SOTA FID score of 1.58 on the ImageNet dataset, and has more than 10x faster learning speed than the previous SOTA DiT.
arXiv Detail & Related papers (2023-03-25T07:47:21Z) - MODETR: Moving Object Detection with Transformers [2.4366811507669124]
Moving Object Detection (MOD) is a crucial task for the Autonomous Driving pipeline.
In this paper, we tackle this problem through multi-head attention mechanisms, both across the spatial and motion streams.
We propose MODETR; a Moving Object DEtection TRansformer network, comprised of multi-stream transformers for both spatial and motion modalities.
arXiv Detail & Related papers (2021-06-21T21:56:46Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.