Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation
- URL: http://arxiv.org/abs/2511.12631v1
- Date: Sun, 16 Nov 2025 14:52:54 GMT
- Title: Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation
- Authors: Yushe Cao, Dianxi Shi, Xing Fu, Xuechao Zou, Haikuo Peng, Xueqi Li, Chun Yu, Junliang Xing,
- Abstract summary: MDiTFace is a customized diffusion transformer framework that employs a unified tokenization strategy to process semantic mask and text inputs.<n>Extensive experiments demonstrate that MDiTFace significantly outperforms other competing methods in terms of both facial fidelity and conditional consistency.
- Score: 33.45651294176388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While significant progress has been achieved in multimodal facial generation using semantic masks and textual descriptions, conventional feature fusion approaches often fail to enable effective cross-modal interactions, thereby leading to suboptimal generation outcomes. To address this challenge, we introduce MDiTFace--a customized diffusion transformer framework that employs a unified tokenization strategy to process semantic mask and text inputs, eliminating discrepancies between heterogeneous modality representations. The framework facilitates comprehensive multimodal feature interaction through stacked, newly designed multivariate transformer blocks that process all conditions synchronously. Additionally, we design a novel decoupled attention mechanism by dissociating implicit dependencies between mask tokens and temporal embeddings. This mechanism segregates internal computations into dynamic and static pathways, enabling caching and reuse of features computed in static pathways after initial calculation, thereby reducing additional computational overhead introduced by mask condition by over 94% while maintaining performance. Extensive experiments demonstrate that MDiTFace significantly outperforms other competing methods in terms of both facial fidelity and conditional consistency.
Related papers
- Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion [60.186310080523135]
Bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders development of truly unified multimodal systems.<n>We propose textbfCoM-DAD, a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process.<n>Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.
arXiv Detail & Related papers (2026-01-07T16:21:19Z) - Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z) - InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions [66.45467539731288]
In multimodal representation, synergistic interactions between modalities provide complementary information and create unique outcomes.<n>Existing methods may struggle to capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical.<n>We introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy.
arXiv Detail & Related papers (2025-09-28T09:31:59Z) - MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing [53.98607267063729]
MotionVerse is a framework to comprehend, generate, and edit human motion in both single-person and multi-person scenarios.<n>We employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens.<n>We also introduce a textitDelay Parallel Modeling strategy, which temporally staggers the encoding of residual token streams.
arXiv Detail & Related papers (2025-09-28T04:20:56Z) - JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on [15.59886380067986]
JCo-MVTON is a novel framework that overcomes limitations by integrating diffusion-based image generation with multi-modal conditional fusion.<n>It achieves state-of-the-art performance on public benchmarks including DressCode, significantly outperforming existing methods in both quantitative metrics and human evaluations.
arXiv Detail & Related papers (2025-08-25T02:43:57Z) - DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers [86.5541501589166]
DiffMoE is a batch-level global token pool that enables experts to access global token distributions during training.<n>It achieves state-of-the-art performance among diffusion models on ImageNet benchmark.<n>The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation.
arXiv Detail & Related papers (2025-03-18T17:57:07Z) - Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition [4.151073288078749]
Dynamic Facial Expression Recognition (DFER) facilitates the understanding of psychological intentions through non-verbal communication.<n>Existing methods struggle to manage irrelevant information, such as background noise and redundant semantics, which impacts both efficiency and effectiveness.<n>We propose a novel supervised temporal soft masked autoencoder network for DFER, namely AdaTosk, which integrates a parallel supervised classification branch with the self-supervised reconstruction branch.
arXiv Detail & Related papers (2025-02-28T12:45:08Z) - Cross-Task Multi-Branch Vision Transformer for Facial Expression and Mask Wearing Classification [13.995453649985732]
We propose a unified multi-branch vision transformer for facial expression recognition and mask wearing classification tasks.
Our approach extracts shared features for both tasks using a dual-branch architecture.
Our proposed framework reduces the overall complexity compared with using separate networks for both tasks.
arXiv Detail & Related papers (2024-04-22T22:02:19Z) - Computation and Parameter Efficient Multi-Modal Fusion Transformer for
Cued Speech Recognition [48.84506301960988]
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people.
automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
arXiv Detail & Related papers (2024-01-31T05:20:29Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training [55.12082817901671]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)<n>MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.<n>Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - FER-former: Multi-modal Transformer for Facial Expression Recognition [14.219492977523682]
A novel multifarious supervision-steering Transformer for Facial Expression Recognition is proposed in this paper.
Our approach features multi-granularity embedding integration, hybrid self-attention scheme, and heterogeneous domain-steering supervision.
Experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-arts.
arXiv Detail & Related papers (2023-03-23T02:29:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.