Related papers: PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training

PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training

URL: http://arxiv.org/abs/2508.09691v1
Date: Wed, 13 Aug 2025 10:37:41 GMT
Title: PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training
Authors: Yin Xie, Zhichao Chen, Xiaoze Yu, Yongle Zhao, Xiang An, Kaicheng Yang, Zimin Ran, Jia Guo, Ziyong Feng, Jiankang Deng,
Abstract summary: PaCo-FR is an unsupervised framework that combines masked image modeling with patch-pixel alignment.<n>PaCo-FR achieves state-of-the-art performance across several facial analysis tasks with just 2 million unlabeled images for pre-training.
Score: 32.52750192639004
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Facial representation pre-training is crucial for tasks like facial recognition, expression analysis, and virtual reality. However, existing methods face three key challenges: (1) failing to capture distinct facial features and fine-grained semantics, (2) ignoring the spatial structure inherent to facial anatomy, and (3) inefficiently utilizing limited labeled data. To overcome these, we introduce PaCo-FR, an unsupervised framework that combines masked image modeling with patch-pixel alignment. Our approach integrates three innovative components: (1) a structured masking strategy that preserves spatial coherence by aligning with semantically meaningful facial regions, (2) a novel patch-based codebook that enhances feature discrimination with multiple candidate tokens, and (3) spatial consistency constraints that preserve geometric relationships between facial components. PaCo-FR achieves state-of-the-art performance across several facial analysis tasks with just 2 million unlabeled images for pre-training. Our method demonstrates significant improvements, particularly in scenarios with varying poses, occlusions, and lighting conditions. We believe this work advances facial representation learning and offers a scalable, efficient solution that reduces reliance on expensive annotated datasets, driving more effective facial analysis systems.

Related papers

Face-MakeUpV2: Facial Consistency Learning for Controllable Text-to-Image Generation [4.383815913901858]
Face-MakeUpV2 is a facial image generation model that aims to maintain consistency of face ID and physical characteristics with the reference image.<n>In experiments, Face-MakeUpV2 achieves best overall performance in terms of preserving face ID and maintaining physical consistency of the reference images.
arXiv Detail & Related papers (2025-10-17T09:31:08Z)
Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery [6.096726247356906]
We introduce Discrete Facial, an unsupervised, data-driven alternative of compact and interpretable dictionary of facial expressions.<n>Our system consistently outperforms both FACS-based pipelines and strong image and video representation learning models.<n>Our representation covers a wider variety of facial displays, highlighting its potential as a scalable and effective alternative to FACS for psychological and affective computing applications.
arXiv Detail & Related papers (2025-10-02T04:44:45Z)
From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts [69.44297222099175]
We introduce a Mixture of Facial Experts (MoFE) that captures distinct but mutually reinforcing aspects of facial attributes.<n>To mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency.<n>We have curated and refined a Large Face Angles (LFA) dataset from existing open-source human video datasets.
arXiv Detail & Related papers (2025-08-13T04:10:16Z)
SketchYourSeg: Mask-Free Subjective Image Segmentation via Freehand Sketches [116.1810651297801]
SketchYourSeg establishes freehand sketches as a powerful query modality for subjective image segmentation.<n>Our evaluations demonstrate superior performance over existing approaches across diverse benchmarks.
arXiv Detail & Related papers (2025-01-27T13:07:51Z)
Self-Supervised Facial Representation Learning with Facial Region Awareness [13.06996608324306]
Self-supervised pre-training has been proven to be effective in learning transferable representations that benefit various visual tasks. Recent efforts toward this goal are limited to treating each face image as a whole. We propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations.
arXiv Detail & Related papers (2024-03-04T15:48:56Z)
Text-Guided Face Recognition using Multi-Granularity Cross-Modal Contrastive Learning [0.0]
We introduce text-guided face recognition (TGFR) to analyze the impact of integrating facial attributes in the form of natural language descriptions. TGFR demonstrates remarkable improvements, particularly on low-quality images, over existing face recognition models.
arXiv Detail & Related papers (2023-12-14T22:04:22Z)
Effective Adapter for Face Recognition in the Wild [72.75516495170199]
We tackle the challenge of face recognition in the wild, where images often suffer from low quality and real-world distortions. Traditional approaches-either training models directly on degraded images or their enhanced counterparts using face restoration techniques-have proven ineffective. We propose an effective adapter for augmenting existing face recognition models trained on high-quality facial datasets.
arXiv Detail & Related papers (2023-12-04T08:55:46Z)
Improving Face Recognition from Caption Supervision with Multi-Granular Contextual Feature Aggregation [0.0]
We introduce caption-guided face recognition (CGFR) as a new framework to improve the performance of commercial-off-the-shelf (COTS) face recognition systems. We implement the proposed CGFR framework on two face recognition models (ArcFace and AdaFace) and evaluated its performance on the Multi-Modal CelebA-HQ dataset.
arXiv Detail & Related papers (2023-08-13T23:52:15Z)
Self-supervised Contrastive Learning of Multi-view Facial Expressions [9.949781365631557]
Facial expression recognition (FER) has emerged as an important component of human-computer interaction systems. We propose Contrastive Learning of Multi-view facial Expressions (CL-MEx) to exploit facial images captured simultaneously from different angles towards FER.
arXiv Detail & Related papers (2021-08-15T11:23:34Z)
Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection [65.92058628082322]
Non-parametric face modeling aims to reconstruct 3D face only from images without shape assumptions. This paper presents a novel Learning to Aggregate and Personalize framework for unsupervised robust 3D face modeling.
arXiv Detail & Related papers (2021-06-15T03:10:17Z)
Towards NIR-VIS Masked Face Recognition [47.00916333095693]
Near-infrared to visible (NIR-VIS) face recognition is the most common case in heterogeneous face recognition. We propose a novel training method to maximize the mutual information shared by the face representation of two domains. In addition, a 3D face reconstruction based approach is employed to synthesize masked face from the existing NIR image.
arXiv Detail & Related papers (2021-04-14T10:40:09Z)
Learning Oracle Attention for High-fidelity Face Completion [121.72704525675047]
We design a comprehensive framework for face completion based on the U-Net structure. We propose a dual spatial attention module to efficiently learn the correlations between facial textures at multiple scales. We take the location of the facial components as prior knowledge and impose a multi-discriminator on these regions.
arXiv Detail & Related papers (2020-03-31T01:37:10Z)
Dual-Attention GAN for Large-Pose Face Frontalization [59.689836951934694]
We present a novel Dual-Attention Generative Adversarial Network (DA-GAN) for photo-realistic face frontalization. Specifically, a self-attention-based generator is introduced to integrate local features with their long-range dependencies. A novel face-attention-based discriminator is applied to emphasize local features of face regions.
arXiv Detail & Related papers (2020-02-17T20:00:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.