Related papers: Generative Pre-training for Subjective Tasks: A Diffusion Transformer-Based Framework for Facial Beauty Prediction

Generative Pre-training for Subjective Tasks: A Diffusion Transformer-Based Framework for Facial Beauty Prediction

URL: http://arxiv.org/abs/2507.20363v1
Date: Sun, 27 Jul 2025 17:33:51 GMT
Title: Generative Pre-training for Subjective Tasks: A Diffusion Transformer-Based Framework for Facial Beauty Prediction
Authors: Djamel Eddine Boukhari, Ali chemsa,
Abstract summary: Facial Beauty Prediction (FBP) is a challenging computer vision task due to its subjective nature and the subtle, holistic features that influence human perception.<n>In this paper, we propose a novel two-stage framework that leverages the power of generative models to create a superior, domain-specific feature extractor.<n>Our method, termed Diff-FBP, sets a new state-of-the-art on the FBP5500 benchmark, achieving a Pearson Correlation Coefficient (PCC) of 0.932, significantly outperforming prior art based on general-purpose pre-training.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Facial Beauty Prediction (FBP) is a challenging computer vision task due to its subjective nature and the subtle, holistic features that influence human perception. Prevailing methods, often based on deep convolutional networks or standard Vision Transformers pre-trained on generic object classification (e.g., ImageNet), struggle to learn feature representations that are truly aligned with high-level aesthetic assessment. In this paper, we propose a novel two-stage framework that leverages the power of generative models to create a superior, domain-specific feature extractor. In the first stage, we pre-train a Diffusion Transformer on a large-scale, unlabeled facial dataset (FFHQ) through a self-supervised denoising task. This process forces the model to learn the fundamental data distribution of human faces, capturing nuanced details and structural priors essential for aesthetic evaluation. In the second stage, the pre-trained and frozen encoder of our Diffusion Transformer is used as a backbone feature extractor, with only a lightweight regression head being fine-tuned on the target FBP dataset (FBP5500). Our method, termed Diff-FBP, sets a new state-of-the-art on the FBP5500 benchmark, achieving a Pearson Correlation Coefficient (PCC) of 0.932, significantly outperforming prior art based on general-purpose pre-training. Extensive ablation studies validate that our generative pre-training strategy is the key contributor to this performance leap, creating feature representations that are more semantically potent for subjective visual tasks.

Related papers

Towards Scalable Pre-training of Visual Tokenizers for Generation [41.785568766118594]
We present a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses.<n>After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods.
arXiv Detail & Related papers (2025-12-15T18:59:54Z)
FairViT-GAN: A Hybrid Vision Transformer with Adversarial Debiasing for Fair and Explainable Facial Beauty Prediction [0.0]
We propose textbfFairViT-GAN, a novel hybrid framework for facial beauty prediction.<n>We show that FairViT-GAN sets a new state-of-the-art in predictive accuracy, achieving a Pearson Correlation of textbf0.9230 and reducing RMSE to textbf0.2650.<n>Our analysis reveals a remarkable textbf82.9% reduction in the performance gap between ethnic subgroups, with the adversary's classification accuracy dropping to near-random chance (52.1%)
arXiv Detail & Related papers (2025-09-28T12:55:31Z)
SynergyNet: Fusing Generative Priors and State-Space Models for Facial Beauty Prediction [0.0]
This paper introduces the textbfMamba-Diffusion Network (MD-Net), a novel dual-stream architecture for predicting facial beauty.<n>MD-Net sets a new state-of-the-art, achieving a Pearson Correlation of textbf0.9235 and demonstrating the significant potential of hybrid architectures.
arXiv Detail & Related papers (2025-09-21T17:36:42Z)
Scale-interaction transformer: a hybrid cnn-transformer model for facial beauty prediction [0.0]
We introduce the Scale-Interaction Transformer (SIT), a novel hybrid deep learning architecture that synergizes the feature extraction power of CNNs with the relational modeling capabilities of Transformers.<n>We conduct extensive experiments on the widely-used SCUT-FBP5500 benchmark dataset, where the proposed SIT model establishes a new state-of-the-art.<n>Our findings demonstrate that explicitly modeling the interplay between multi-scale visual cues is crucial for high-performance FBP.
arXiv Detail & Related papers (2025-09-05T13:16:55Z)
IN45023 Neural Network Design Patterns in Computer Vision Seminar Report, Summer 2025 [0.0]
This report analyzes the evolution of key design patterns in computer vision by examining six influential papers.<n>We review ResNet, which introduced residual connections to overcome the vanishing gradient problem.<n>We examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer ar- chitecture to sequences of image patches.
arXiv Detail & Related papers (2025-07-31T09:08:11Z)
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation [66.73899356886652]
We build an image tokenizer directly atop pre-trained vision foundation models.<n>Our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality.<n>It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks.
arXiv Detail & Related papers (2025-07-11T09:32:45Z)
Efficient Generative Model Training via Embedded Representation Warmup [12.485320863366411]
Generative models face a fundamental challenge: they must simultaneously learn high-level semantic concepts and low-level synthesis details.<n>We propose Embedded Representation Warmup, a principled two-phase training framework.<n>Our framework achieves a 11.5$times$ speedup in 350 epochs to reach FID=1.41 compared to single-phase methods like REPA.
arXiv Detail & Related papers (2025-04-14T12:43:17Z)
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.<n>Current state-of-the-art methods focus on training innovative architectural designs on confined datasets.<n>We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z)
Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models. Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information. Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z)
Research on Personalized Compression Algorithm for Pre-trained Models Based on Homomorphic Entropy Increase [2.6513322539118582]
We explore the challenges and evolution of two key technologies in the current field of AI: Vision Transformer model and Large Language Model (LLM) Vision Transformer captures global information by splitting images into small pieces, but its high reference count and compute overhead limit deployment on mobile devices. LLM has revolutionized natural language processing, but it also faces huge deployment challenges.
arXiv Detail & Related papers (2024-08-16T11:56:49Z)
UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization. We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z)
Generalized Nested Latent Variable Models for Lossy Coding applied to Wind Turbine Scenarios [14.48369551534582]
A learning-based approach seeks to minimize the compromise between compression rate and reconstructed image quality. A successful technique consists in introducing a deep hyperprior that operates within a 2-level nested latent variable model. This paper extends this concept by designing a generalized L-level nested generative model with a Markov chain structure.
arXiv Detail & Related papers (2024-06-10T11:00:26Z)
Heterogeneous Federated Learning with Splited Language Model [22.65325348176366]
Federated Split Learning (FSL) is a promising distributed learning paradigm in practice. In this paper, we harness Pre-trained Image Transformers (PITs) as the initial model, coined FedV, to accelerate the training process and improve model robustness. We are the first to provide a systematic evaluation of FSL methods with PITs in real-world datasets, different partial device participations, and heterogeneous data splits.
arXiv Detail & Related papers (2024-03-24T07:33:08Z)
Spatio-Temporal Encoding of Brain Dynamics with Surface Masked Autoencoders [10.097983222759884]
Surface Masked AutoEncoder (sMAE) and surface Masked AutoEncoder (MAE) These models are trained to reconstruct cortical feature maps from masked versions of the input by learning strong latent representations of cortical development and structure function. Results show that (v)sMAE pre-trained models improve phenotyping prediction performance on multiple tasks by $ge 26%$, and offer faster convergence relative to models trained from scratch.
arXiv Detail & Related papers (2023-08-10T10:01:56Z)
VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation [33.41226268323332]
Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network. We propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework.
arXiv Detail & Related papers (2023-07-08T06:49:54Z)
Towards Robust Blind Face Restoration with Codebook Lookup Transformer [94.48731935629066]
Blind face restoration is a highly ill-posed problem that often requires auxiliary guidance. We show that a learned discrete codebook prior in a small proxy space cast blind face restoration as a code prediction task. We propose a Transformer-based prediction network, named CodeFormer, to model global composition and context of the low-quality faces.
arXiv Detail & Related papers (2022-06-22T17:58:01Z)
Blind Face Restoration: Benchmark Datasets and a Baseline Model [63.053331687284064]
Blind Face Restoration (BFR) aims to construct a high-quality (HQ) face image from its corresponding low-quality (LQ) input. We first synthesize two blind face restoration benchmark datasets called EDFace-Celeb-1M (BFR128) and EDFace-Celeb-150K (BFR512) State-of-the-art methods are benchmarked on them under five settings including blur, noise, low resolution, JPEG compression artifacts, and the combination of them (full degradation)
arXiv Detail & Related papers (2022-06-08T06:34:24Z)
ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers [31.908276711898548]
Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays. We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement. In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning.
arXiv Detail & Related papers (2022-02-23T11:11:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.