SynergyNet: Fusing Generative Priors and State-Space Models for Facial Beauty Prediction
- URL: http://arxiv.org/abs/2509.17172v1
- Date: Sun, 21 Sep 2025 17:36:42 GMT
- Title: SynergyNet: Fusing Generative Priors and State-Space Models for Facial Beauty Prediction
- Authors: Djamel Eddine Boukhari,
- Abstract summary: This paper introduces the textbfMamba-Diffusion Network (MD-Net), a novel dual-stream architecture for predicting facial beauty.<n>MD-Net sets a new state-of-the-art, achieving a Pearson Correlation of textbf0.9235 and demonstrating the significant potential of hybrid architectures.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The automated prediction of facial beauty is a benchmark task in affective computing that requires a sophisticated understanding of both local aesthetic details (e.g., skin texture) and global facial harmony (e.g., symmetry, proportions). Existing models, based on either Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs), exhibit inherent architectural biases that limit their performance; CNNs excel at local feature extraction but struggle with long-range dependencies, while ViTs model global relationships at a significant computational cost. This paper introduces the \textbf{Mamba-Diffusion Network (MD-Net)}, a novel dual-stream architecture that resolves this trade-off by delegating specialized roles to state-of-the-art models. The first stream leverages a frozen U-Net encoder from a pre-trained latent diffusion model, providing a powerful generative prior for fine-grained aesthetic qualities. The second stream employs a Vision Mamba (Vim), a modern state-space model, to efficiently capture global facial structure with linear-time complexity. By synergistically integrating these complementary representations through a cross-attention mechanism, MD-Net creates a holistic and nuanced feature space for prediction. Evaluated on the SCUT-FBP5500 benchmark, MD-Net sets a new state-of-the-art, achieving a Pearson Correlation of \textbf{0.9235} and demonstrating the significant potential of hybrid architectures that fuse generative and sequential modeling paradigms for complex visual assessment tasks.
Related papers
- Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders [74.72147962028265]
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet.<n>We investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation.
arXiv Detail & Related papers (2026-01-22T18:58:16Z) - Future Optical Flow Prediction Improves Robot Control & Video Generation [100.87884718953099]
We introduce FOFPred, a novel optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture.<n>Our model is trained on web-scale human activity data-a highly scalable but unstructured source.<n> Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred.
arXiv Detail & Related papers (2026-01-15T18:49:48Z) - Semantics and Content Matter: Towards Multi-Prior Hierarchical Mamba for Image Deraining [95.00432497331583]
Multi-Prior Hierarchical Mamba (MPHM) network for image deraining.<n>MPHM integrates macro-semantic textual priors (CLIP) for task-level semantic guidance and micro-structural visual priors (DINOv2) for scene-aware structural information.<n>Experiments demonstrate MPHM's state-of-the-art performance, achieving a 0.57 dB PSNR gain on the Rain200H dataset.
arXiv Detail & Related papers (2025-11-17T08:08:59Z) - VM-BeautyNet: A Synergistic Ensemble of Vision Transformer and Mamba for Facial Beauty Prediction [0.0]
This paper introduces a novel, heterogeneous ensemble architecture, textbfVM-BeautyNet, that fuses the complementary strengths of a Vision Transformer and a Mamba-based Vision model.<n>Our proposed VM-BeautyNet achieves state-of-the-art performance, with a textbfPearson Correlation (PC) of 0.9212, a textbfMean Absolute Error (MAE) of 0.2085, and a textbfRoot Mean Square Error (RMSE) of 0.2698.
arXiv Detail & Related papers (2025-10-17T21:10:46Z) - FairViT-GAN: A Hybrid Vision Transformer with Adversarial Debiasing for Fair and Explainable Facial Beauty Prediction [0.0]
We propose textbfFairViT-GAN, a novel hybrid framework for facial beauty prediction.<n>We show that FairViT-GAN sets a new state-of-the-art in predictive accuracy, achieving a Pearson Correlation of textbf0.9230 and reducing RMSE to textbf0.2650.<n>Our analysis reveals a remarkable textbf82.9% reduction in the performance gap between ethnic subgroups, with the adversary's classification accuracy dropping to near-random chance (52.1%)
arXiv Detail & Related papers (2025-09-28T12:55:31Z) - Scale-interaction transformer: a hybrid cnn-transformer model for facial beauty prediction [0.0]
We introduce the Scale-Interaction Transformer (SIT), a novel hybrid deep learning architecture that synergizes the feature extraction power of CNNs with the relational modeling capabilities of Transformers.<n>We conduct extensive experiments on the widely-used SCUT-FBP5500 benchmark dataset, where the proposed SIT model establishes a new state-of-the-art.<n>Our findings demonstrate that explicitly modeling the interplay between multi-scale visual cues is crucial for high-performance FBP.
arXiv Detail & Related papers (2025-09-05T13:16:55Z) - Towards Efficient General Feature Prediction in Masked Skeleton Modeling [59.46799426434277]
We propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling.<n>Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations.
arXiv Detail & Related papers (2025-09-03T18:05:02Z) - Mamba-CNN: A Hybrid Architecture for Efficient and Accurate Facial Beauty Prediction [0.0]
We propose Mamba-CNN, a novel and efficient hybrid architecture.<n>Mamba-CNN integrates a lightweight, Mamba-inspired State Space Model (SSM) gating mechanism into a hierarchical convolutional backbone.<n>Our findings validate the synergistic potential of combining CNNs with selective SSMs and present a powerful new architectural paradigm for nuanced visual understanding tasks.
arXiv Detail & Related papers (2025-09-01T12:42:04Z) - Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation [52.261584726401686]
We present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model.<n>Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality.
arXiv Detail & Related papers (2025-07-11T09:32:45Z) - RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement [59.364418120895]
Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications.<n>We develop a novel relation-driven Mamba framework for effective UIE (RD-UIE)<n>Experiments on underwater enhancement benchmarks demonstrate RD-UIE outperforms the state-of-the-art approach WMamba.
arXiv Detail & Related papers (2025-05-02T12:21:44Z) - HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake Detection [4.908389661988192]
HFMF is a comprehensive two-stage deepfake detection framework.<n>It integrates vision Transformers and convolutional nets through a hierarchical feature fusion mechanism.<n>We demonstrate that our architecture achieves superior performance across diverse dataset benchmarks.
arXiv Detail & Related papers (2025-01-10T00:20:29Z) - MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking [51.28485682954006]
We propose a pure Mamba-based framework (MambaVT) to fully exploit intrinsic-temporal contextual modeling for robust visible-thermal tracking.
Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations.
Experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks.
arXiv Detail & Related papers (2024-08-15T02:29:00Z) - DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework [2.187990941788468]
generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio.
Model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture.
arXiv Detail & Related papers (2024-08-01T08:22:47Z) - Deep Tensor Network [9.910562011343009]
We introduce the Deep Network, a new architectural framework that reformulates attention by unifying the expressive power of tensor algebra with neural network design.<n>Our approach moves beyond both conventional dot-product attention and subsequent linear-time approximations to capture higher-order statistical dependencies.
arXiv Detail & Related papers (2023-11-18T14:41:33Z) - TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition [63.93802691275012]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) to simultaneously learn global and local dynamics.<n>We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network.<n>In the ImageNet-1K classification, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - Deep Autoencoding Topic Model with Scalable Hybrid Bayesian Inference [55.35176938713946]
We develop deep autoencoding topic model (DATM) that uses a hierarchy of gamma distributions to construct its multi-stochastic-layer generative network.
We propose a Weibull upward-downward variational encoder that deterministically propagates information upward via a deep neural network, followed by a downward generative model.
The efficacy and scalability of our models are demonstrated on both unsupervised and supervised learning tasks on big corpora.
arXiv Detail & Related papers (2020-06-15T22:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.