Related papers: UniF$^2$ace: A Unified Fine-grained Face Understanding and Generation Model

UniF$^2$ace: A Unified Fine-grained Face Understanding and Generation Model

URL: http://arxiv.org/abs/2503.08120v4
Date: Mon, 29 Sep 2025 17:12:48 GMT
Title: UniF$^2$ace: A Unified Fine-grained Face Understanding and Generation Model
Authors: Junzhe Li, Sifan Zhou, Liya Guo, Xuerui Qiu, Linrui Xu, Delin Qu, Tingting Long, Chun Fan, Ming Li, Hehe Fan, Jun Liu, Shuicheng Yan,
Abstract summary: We introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion.<n>This D3Diff significantly enhances the model's ability to synthesize high-fidelity facial details aligned with text input.<n>We construct UniF$2$aceD-1M, a large-scale dataset comprising 130K fine-grained image-caption pairs and 1M visual question-answering pairs.
Score: 62.66515621965686
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unified multimodal models (UMMs) have emerged as a powerful paradigm in fundamental cross-modality research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily faces two challenges: $\textbf{(1)}$ $\textbf{fragmentation development}$, with existing methods failing to unify understanding and generation into a single one, hindering the way to artificial general intelligence. $\textbf{(2) lack of fine-grained facial attributes}$, which are crucial for high-fidelity applications. To handle those issues, we propose $\textbf{UniF$^2$ace}$, $\textit{the first UMM specifically tailored for fine-grained face understanding and generation}$. $\textbf{First}$, we introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood. Moreover, this D3Diff significantly enhances the model's ability to synthesize high-fidelity facial details aligned with text input. $\textbf{Second}$, we propose a multi-level grouped Mixture-of-Experts architecture, adaptively incorporating the semantic and identity facial embeddings to complement the attribute forgotten phenomenon in representation evolvement. $\textbf{Finally}$, to this end, we construct UniF$^2$aceD-1M, a large-scale dataset comprising 130K fine-grained image-caption pairs and 1M visual question-answering pairs, spanning a much wider range of facial attributes than existing datasets. Extensive experiments demonstrate that UniF$^2$ace outperforms existing models with a similar scale in both understanding and generation tasks, with 7.1\% higher Desc-GPT and 6.6\% higher VQA-score, respectively.

Related papers

Reinforcing Multimodal Understanding and Generation with Dual Self-rewards [56.08202047680044]
Large language models (LLMs) unify cross-model understanding and generation into a single framework.<n>Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks.<n>We introduce a self-supervised dual reward mechanism to reinforce the understanding and generation capabilities of LMMs.
arXiv Detail & Related papers (2025-06-09T17:38:45Z)
Bound by semanticity: universal laws governing the generalization-identification tradeoff [8.437463955457423]
We show that finite-resolution similarity is a fundamental emergent informational constraint, not merely a toy-model artifact.<n>These results provide an exact theory of the generalization-identification trade-off and clarify how semantic resolution shapes the representational capacity of deep networks and brains alike.
arXiv Detail & Related papers (2025-06-01T15:56:26Z)
Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction [16.855296683335308]
Uni-Instruct is motivated by our proposed diffusion expansion theory of the $f$-divergence family.<n>On the CIFAR10 generation benchmark, Uni-Instruct achieves record-breaking Frechet Inception Distance (FID) values of textbfemph1.46 for unconditional generation.<n>On the ImageNet-$64times 64$ generation benchmark, Uni-Instruct achieves a new SoTA one-step generation FID of textbfemph1.02, which outperforms its 79-step teacher diffusion with a significant improvement margin of 1.33.
arXiv Detail & Related papers (2025-05-27T05:55:45Z)
FLARE: Robot Learning with Implicit World Modeling [87.81846091038676]
$textbfFLARE$ integrates predictive latent world modeling into robot policy learning.<n>$textbfFLARE$ achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%.<n>Our results establish $textbfFLARE$ as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.
arXiv Detail & Related papers (2025-05-21T15:33:27Z)
H$^3$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning [25.65324419553667]
We introduce $textbfTriply-Hierarchical Diffusion Policy(textbfH$mathbf3$DP)$, a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation.<n> Extensive experiments demonstrate that H$3$DP yields a $mathbf+27.5%$ average relative improvement over baselines across $mathbf44$ simulation tasks and achieves superior performance in $mathbf4$ challenging bimanual real-world manipulation tasks.
arXiv Detail & Related papers (2025-05-12T17:59:43Z)
X-Fusion: Introducing New Modality to Frozen Large Language Models [82.3508830643655]
We propose X-Fusion, a framework that extends pretrained Large Language Models for multimodal tasks. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks.
arXiv Detail & Related papers (2025-04-29T17:59:45Z)
Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis [5.795431510723275]
We present a comprehensive pipeline for multimodal facial state analysis. We introduce a novel Multilevel Multimodal Face Foundation model (MF2) tailored for Action Unit (AU) and emotion recognition. Experimentation show superior performance for AU and emotion detection tasks.
arXiv Detail & Related papers (2025-04-14T16:00:57Z)
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding [84.87802580670579]
We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations. Our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information.
arXiv Detail & Related papers (2025-04-06T09:20:49Z)
Adaptive Multi-Order Graph Regularized NMF with Dual Sparsity for Hyperspectral Unmixing [14.732511023726715]
We propose a novel adaptive multi-order graph regularized NMF method (MOGNMF) with three key features.<n>Experiments on simulated and real hyperspectral data indicate that the proposed method delivers better unmixing results.
arXiv Detail & Related papers (2025-03-25T01:44:02Z)
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation [26.61175134316007]
Text-to-formed (T2I) models are capable of generating high-quality artistic creations and visual content. We propose $textbfWISE, the first benchmark specifically designed for $textbfWorld Knowledge incorporation$bfIntext $textbfSemantic $textbfE$valuation.
arXiv Detail & Related papers (2025-03-10T12:47:53Z)
Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation [0.0]
We build a dataset of 4 million high-quality face image-text pairs (FaceCaptionHQ-4M) based on LAION-Face.<n>We extract/learn multi-scale content features and pose features for the facial image, integrating these into the diffusion model to enhance the preservation of facial identity features for diffusion models.
arXiv Detail & Related papers (2025-01-05T12:46:31Z)
MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored. We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z)
PLUTUS: A Well Pre-trained Large Unified Transformer can Unveil Financial Time Series Regularities [0.848210898747543]
Financial time series modeling is crucial for understanding and predicting market behaviors. Traditional models struggle to capture complex patterns due to non-linearity, non-stationarity, and high noise levels. Inspired by the success of large language models in NLP, we introduce $textbfPLUTUS$, a $textbfP$re-trained $textbfL$arge. PLUTUS is the first open-source, large-scale, pre-trained financial time series model with over one billion parameters.
arXiv Detail & Related papers (2024-08-19T15:59:46Z)
Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z)
M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation [78.77004913030285]
M$3$GPT is an advanced $textbfM$ultimodal, $textbfM$ultitask framework for comprehension and generation. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model. M$3$GPT learns to model the connections and synergies among various motion-relevant tasks.
arXiv Detail & Related papers (2024-05-25T15:21:59Z)
Do Diffusion Models Learn Semantically Meaningful and Efficient Representations? [15.470940905898757]
We perform experiments on conditional DDPMs learning to generate 2D spherical Gaussian bumps centered at specified $x$- and $y$-positions. Our results show that the emergence of semantically meaningful latent representations is key to achieving high performance.
arXiv Detail & Related papers (2024-02-05T18:58:38Z)
One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2 [0.7614628596146599]
We propose an end-to-end framework for simultaneously supporting face edits, facial motions and deformations, and facial identity control for video generation. We employ StyleGAN2 generator to achieve high-fidelity face video re-enactment at $10242$.
arXiv Detail & Related papers (2023-02-15T18:34:15Z)
Dynamic Prototype Mask for Occluded Person Re-Identification [88.7782299372656]
Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part. We propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge. Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously.
arXiv Detail & Related papers (2022-07-19T03:31:13Z)
Unsupervised Semantic Segmentation by Distilling Feature Correspondences [94.73675308961944]
Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation. We present STEGO, a novel framework that distills unsupervised features into high-quality discrete semantic labels. STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff and Cityscapes challenges.
arXiv Detail & Related papers (2022-03-16T06:08:47Z)
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training [120.91411454661741]
We present a pre-trainable Universal-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception and generation. Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality.
arXiv Detail & Related papers (2022-01-11T16:15:07Z)
Improving Robustness and Generality of NLP Models Using Disentangled Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$. We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning. We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)
On the Difference Between the Information Bottleneck and the Deep Information Bottleneck [81.89141311906552]
We revisit the Deep Variational Information Bottleneck and the assumptions needed for its derivation. We show how to circumvent this limitation by optimising a lower bound for $I(T;Y)$ for which only the latter Markov chain has to be satisfied.
arXiv Detail & Related papers (2019-12-31T18:31:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.