Related papers: DisFaceRep: Representation Disentanglement for Co-occurring Facial Components in Weakly Supervised Face Parsing

DisFaceRep: Representation Disentanglement for Co-occurring Facial Components in Weakly Supervised Face Parsing

URL: http://arxiv.org/abs/2508.01250v1
Date: Sat, 02 Aug 2025 08:02:06 GMT
Title: DisFaceRep: Representation Disentanglement for Co-occurring Facial Components in Weakly Supervised Face Parsing
Authors: Xiaoqin Wang, Xianxu Hou, Meidan Ding, Junliang Chen, Kaijun Deng, Jinheng Xie, Linlin Shen,
Abstract summary: We present Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision.<n>WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components.<n>We propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components.
Score: 40.41814863928577
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel-level annotations, such annotations are expensive and labor-intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image-level labels and natural language descriptions. WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co-occurring component disentanglement strategy to explicitly reduce dataset-level bias, and a text-guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at \href{https://github.com/CVI-SZU/DisFaceRep}{\textcolor{cyan}{https://github.com/CVI-SZU/DisFaceRep}}.

Related papers

Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation [1.4195677954898822]
Face filters have become a key element of short-form video content, enabling a wide array of visual effects such as stylization and face swapping.<n>We introduce the novel task of face matting, which estimates fine-grained alpha mattes to separate occluding elements from facial regions.<n>We present FaceMat, a trimap-free, uncertainty-aware framework that predicts high-quality alpha mattes under complex occlusions.
arXiv Detail & Related papers (2025-08-05T04:00:14Z)
Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding [26.768147543628096]
We propose a novel framework that emphasizes object and context comprehension inspired by human cognitive processes. Our method achieves significant performance improvements on three benchmark datasets.
arXiv Detail & Related papers (2024-04-12T16:38:48Z)
Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework. It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected. It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z)
Mask Grounding for Referring Image Segmentation [42.69973300692365]
Referring Image (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. We introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features.
arXiv Detail & Related papers (2023-12-19T14:34:36Z)
Improving Face Recognition from Caption Supervision with Multi-Granular Contextual Feature Aggregation [0.0]
We introduce caption-guided face recognition (CGFR) as a new framework to improve the performance of commercial-off-the-shelf (COTS) face recognition systems. We implement the proposed CGFR framework on two face recognition models (ArcFace and AdaFace) and evaluated its performance on the Multi-Modal CelebA-HQ dataset.
arXiv Detail & Related papers (2023-08-13T23:52:15Z)
UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection [52.91782218300844]
We propose a novel Unsupervised Inconsistency-Aware method based on Vision Transformer, called UIA-ViT. Due to the self-attention mechanism, the attention map among patch embeddings naturally represents the consistency relation, making the vision Transformer suitable for the consistency representation learning.
arXiv Detail & Related papers (2022-10-23T15:24:47Z)
Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers [57.1091606948826]
We propose a novel FER model, named Poker Face Vision Transformer or PF-ViT, to address these challenges. PF-ViT aims to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face. PF-ViT utilizes vanilla Vision Transformers, and its components are pre-trained as Masked Autoencoders on a large facial expression dataset.
arXiv Detail & Related papers (2022-07-22T13:39:06Z)
Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations. Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z)
IA-FaceS: A Bidirectional Method for Semantic Face Editing [8.19063619210761]
This paper proposes a bidirectional method for disentangled face attribute manipulation as well as flexible, controllable component editing. IA-FaceS is developed for the first time without any input visual guidance, such as segmentation masks or sketches. Both quantitative and qualitative results indicate that the proposed method outperforms the other techniques in reconstruction, face attribute manipulation, and component transfer.
arXiv Detail & Related papers (2022-03-24T14:44:56Z)
Reference-guided Face Component Editing [51.29105560090321]
We propose a novel framework termed r-FACE (Reference-guided FAce Component Editing) for diverse and controllable face component editing. Specifically, r-FACE takes an image inpainting model as the backbone, utilizing reference images as conditions for controlling the shape of face components. In order to encourage the framework to concentrate on the target face components, an example-guided attention module is designed to fuse attention features and the target face component features extracted from the reference image.
arXiv Detail & Related papers (2020-06-03T05:34:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.