VM-BeautyNet: A Synergistic Ensemble of Vision Transformer and Mamba for Facial Beauty Prediction
- URL: http://arxiv.org/abs/2510.16220v1
- Date: Fri, 17 Oct 2025 21:10:46 GMT
- Title: VM-BeautyNet: A Synergistic Ensemble of Vision Transformer and Mamba for Facial Beauty Prediction
- Authors: Djamel Eddine Boukhari,
- Abstract summary: This paper introduces a novel, heterogeneous ensemble architecture, textbfVM-BeautyNet, that fuses the complementary strengths of a Vision Transformer and a Mamba-based Vision model.<n>Our proposed VM-BeautyNet achieves state-of-the-art performance, with a textbfPearson Correlation (PC) of 0.9212, a textbfMean Absolute Error (MAE) of 0.2085, and a textbfRoot Mean Square Error (RMSE) of 0.2698.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Facial Beauty Prediction (FBP) is a complex and challenging computer vision task, aiming to model the subjective and intricate nature of human aesthetic perception. While deep learning models, particularly Convolutional Neural Networks (CNNs), have made significant strides, they often struggle to capture the global, holistic facial features that are critical to human judgment. Vision Transformers (ViT) address this by effectively modeling long-range spatial relationships, but their quadratic complexity can be a bottleneck. This paper introduces a novel, heterogeneous ensemble architecture, \textbf{VM-BeautyNet}, that synergistically fuses the complementary strengths of a Vision Transformer and a Mamba-based Vision model, a recent advancement in State-Space Models (SSMs). The ViT backbone excels at capturing global facial structure and symmetry, while the Mamba backbone efficiently models long-range dependencies with linear complexity, focusing on sequential features and textures. We evaluate our approach on the benchmark SCUT-FBP5500 dataset. Our proposed VM-BeautyNet achieves state-of-the-art performance, with a \textbf{Pearson Correlation (PC) of 0.9212}, a \textbf{Mean Absolute Error (MAE) of 0.2085}, and a \textbf{Root Mean Square Error (RMSE) of 0.2698}. Furthermore, through Grad-CAM visualizations, we provide interpretability analysis that confirms the complementary feature extraction of the two backbones, offering new insights into the model's decision-making process and presenting a powerful new architectural paradigm for computational aesthetics.
Related papers
- RISE-Video: Can Video Generators Decode Implicit World Rules? [71.92434352963427]
We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
arXiv Detail & Related papers (2026-02-05T18:36:10Z) - Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory [101.2076718776139]
We propose a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments.<n>We introduce a Pose-free Memory (HPMC) that distills historical latents into a fixed-budget geometric representation.<n>We also propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic.
arXiv Detail & Related papers (2026-02-02T17:52:56Z) - UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass [83.7071371474926]
UniSH is a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction.<n>Our framework bridges strong, disparate priors from scene reconstruction and HMR.<n>Our model achieves state-of-the-art performance on human-centric scene reconstruction.
arXiv Detail & Related papers (2026-01-03T16:06:27Z) - Integrating ConvNeXt and Vision Transformers for Enhancing Facial Age Estimation [10.995974662579124]
We present a novel hybrid architecture that combines convolutional neural networks (CNNs) with Vision Transformers (ViT)<n>Our proposed ConvNeXt-ViT hybrid solution was thoroughly evaluated on benchmark age estimation datasets.<n>The results show that the ConvNeXt-ViT hybrid not only outperforms traditional methods, but also provides a robust foundation for future advances in age estimation.
arXiv Detail & Related papers (2025-10-31T09:36:28Z) - FairViT-GAN: A Hybrid Vision Transformer with Adversarial Debiasing for Fair and Explainable Facial Beauty Prediction [0.0]
We propose textbfFairViT-GAN, a novel hybrid framework for facial beauty prediction.<n>We show that FairViT-GAN sets a new state-of-the-art in predictive accuracy, achieving a Pearson Correlation of textbf0.9230 and reducing RMSE to textbf0.2650.<n>Our analysis reveals a remarkable textbf82.9% reduction in the performance gap between ethnic subgroups, with the adversary's classification accuracy dropping to near-random chance (52.1%)
arXiv Detail & Related papers (2025-09-28T12:55:31Z) - SynergyNet: Fusing Generative Priors and State-Space Models for Facial Beauty Prediction [0.0]
This paper introduces the textbfMamba-Diffusion Network (MD-Net), a novel dual-stream architecture for predicting facial beauty.<n>MD-Net sets a new state-of-the-art, achieving a Pearson Correlation of textbf0.9235 and demonstrating the significant potential of hybrid architectures.
arXiv Detail & Related papers (2025-09-21T17:36:42Z) - Scale-interaction transformer: a hybrid cnn-transformer model for facial beauty prediction [0.0]
We introduce the Scale-Interaction Transformer (SIT), a novel hybrid deep learning architecture that synergizes the feature extraction power of CNNs with the relational modeling capabilities of Transformers.<n>We conduct extensive experiments on the widely-used SCUT-FBP5500 benchmark dataset, where the proposed SIT model establishes a new state-of-the-art.<n>Our findings demonstrate that explicitly modeling the interplay between multi-scale visual cues is crucial for high-performance FBP.
arXiv Detail & Related papers (2025-09-05T13:16:55Z) - Mamba-CNN: A Hybrid Architecture for Efficient and Accurate Facial Beauty Prediction [0.0]
We propose Mamba-CNN, a novel and efficient hybrid architecture.<n>Mamba-CNN integrates a lightweight, Mamba-inspired State Space Model (SSM) gating mechanism into a hierarchical convolutional backbone.<n>Our findings validate the synergistic potential of combining CNNs with selective SSMs and present a powerful new architectural paradigm for nuanced visual understanding tasks.
arXiv Detail & Related papers (2025-09-01T12:42:04Z) - DAMamba: Vision State Space Model with Dynamic Adaptive Scan [51.81060691414399]
State space models (SSMs) have recently garnered significant attention in computer vision.<n>We propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions.<n>Based on DAS, we propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks.
arXiv Detail & Related papers (2025-02-18T08:12:47Z) - MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes [35.16430027877207]
MOVIS aims to enhance the structural awareness of the view-conditioned diffusion model for multi-object NVS.<n>We introduce an auxiliary task requiring the model to simultaneously predict novel view object masks.<n>Our method exhibits strong generalization capabilities and produces consistent novel view synthesis.
arXiv Detail & Related papers (2024-12-16T05:23:45Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking [51.28485682954006]
We propose a pure Mamba-based framework (MambaVT) to fully exploit intrinsic-temporal contextual modeling for robust visible-thermal tracking.
Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations.
Experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks.
arXiv Detail & Related papers (2024-08-15T02:29:00Z) - Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms [91.19304518033144]
We aim to align vision models with human aesthetic standards in a retrieval system.
We propose a preference-based reinforcement learning method that fine-tunes the vision models to better align the vision models with human aesthetics.
arXiv Detail & Related papers (2024-06-13T17:59:20Z) - VMamba: Visual State Space Model [98.0517369083152]
We adapt Mamba, a state-space language model, into VMamba, a vision backbone with linear time complexity.<n>At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module.
arXiv Detail & Related papers (2024-01-18T17:55:39Z) - Lightweight Vision Transformer with Bidirectional Interaction [59.39874544410419]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.<n>Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.