Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation
- URL: http://arxiv.org/abs/2509.00428v1
- Date: Sat, 30 Aug 2025 09:21:07 GMT
- Title: Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation
- Authors: Xuechao Zou, Shun Zhang, Xing Fu, Yue Li, Kai Li, Yushe Cao, Congyan Lang, Pin Tao, Junliang Xing,
- Abstract summary: Face-MoGLE is a novel framework for semantic-decoupled latent modeling.<n>It provides high-quality, controllable face generation with strong potential in generative modeling and security applications.
- Score: 37.40162325131809
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Project page is available at https://github.com/XavierJiezou/Face-MoGLE.
Related papers
- IdGlow: Dynamic Identity Modulation for Multi-Subject Generation [23.20674988897558]
We present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models.<n>In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics.<n>In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts.
arXiv Detail & Related papers (2026-02-28T11:56:34Z) - StdGEN++: A Comprehensive System for Semantic-Decomposed 3D Character Generation [57.06461272772509]
StdGEN++ is a novel and comprehensive system for generating high-fidelity, semantically decomposed 3D characters from diverse inputs.<n>It achieves state-of-the-art performance, significantly outperforming existing methods in geometric accuracy and semantic disentanglement.<n>The resulting structural independence unlocks advanced downstream capabilities, including non-destructive editing, physics-compliant animation, and gaze tracking.
arXiv Detail & Related papers (2026-01-12T15:41:27Z) - JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on [15.59886380067986]
JCo-MVTON is a novel framework that overcomes limitations by integrating diffusion-based image generation with multi-modal conditional fusion.<n>It achieves state-of-the-art performance on public benchmarks including DressCode, significantly outperforming existing methods in both quantitative metrics and human evaluations.
arXiv Detail & Related papers (2025-08-25T02:43:57Z) - ExpertGen: Training-Free Expert Guidance for Controllable Text-to-Face Generation [49.294779074232686]
ExpertGen is a training-free framework that leverages pre-trained expert models to guide generation with fine control.<n>We show qualitatively and quantitatively that expert models can guide the generation process with high precision.
arXiv Detail & Related papers (2025-05-22T20:09:21Z) - LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention [4.0810988694972385]
We introduce Layer-aware Mask Modulation Vision Transformer (LAMM-ViT), a Vision Transformer designed for robust facial forgery detection.<n>LAMM-ViT integrates Region-Guided Multi-Head Attention (RG-MHA) and Layer-aware Mask Modulation (LAMM) components within each layer.<n>In cross-model generalization tests, LAMM-ViT demonstrates superior performance, achieving 94.09% mean ACC and 98.62% mean AP.
arXiv Detail & Related papers (2025-05-12T16:42:19Z) - DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers [86.5541501589166]
DiffMoE is a batch-level global token pool that enables experts to access global token distributions during training.<n>It achieves state-of-the-art performance among diffusion models on ImageNet benchmark.<n>The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation.
arXiv Detail & Related papers (2025-03-18T17:57:07Z) - Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models [106.94827590977337]
We propose a novel world model for Multi-Agent RL (MARL) that learns decentralized local dynamics for scalability.
We also introduce a Perceiver Transformer as an effective solution to enable centralized representation aggregation.
Results on Starcraft Multi-Agent Challenge (SMAC) show that it outperforms strong model-free approaches and existing model-based methods in both sample efficiency and overall performance.
arXiv Detail & Related papers (2024-06-22T12:40:03Z) - Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control [59.954322727683746]
Face-Adapter is designed for high-precision and high-fidelity face editing for pre-trained diffusion models.
Face-Adapter achieves comparable or even superior performance in terms of motion control precision, ID retention capability, and generation quality.
arXiv Detail & Related papers (2024-05-21T17:50:12Z) - Controllable Face Synthesis with Semantic Latent Diffusion Models [6.438244172631555]
We propose a SIS framework based on a novel Latent Diffusion Model architecture for human face generation and editing.
The proposed system utilizes both SPADE normalization and cross-attention layers to merge shape and style information and, by doing so, allows for a precise control over each of the semantic parts of the human face.
arXiv Detail & Related papers (2024-03-19T14:02:13Z) - Multimodal-driven Talking Face Generation via a Unified Diffusion-based
Generator [29.58245990622227]
Multimodal-driven talking face generation refers to animating a portrait with the given pose, expression, and gaze transferred from the driving image and video, or estimated from the text and audio.
Existing methods ignore the potential of text modal, and their generators mainly follow the source-oriented feature paradigm coupled with unstable GAN frameworks.
We derive a novel paradigm free of unstable seesaw-style optimization, resulting in simple, stable, and effective training and inference schemes.
arXiv Detail & Related papers (2023-05-04T07:01:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.