S$^3$POT: Contrast-Driven Face Occlusion Segmentation via Self-Supervised Prompt Learning
- URL: http://arxiv.org/abs/2602.00635v1
- Date: Sat, 31 Jan 2026 10:05:13 GMT
- Title: S$^3$POT: Contrast-Driven Face Occlusion Segmentation via Self-Supervised Prompt Learning
- Authors: Lingsong Wang, Mancheng Meng, Ziyan Wu, Terrence Chen, Fan Yang, Dinggang Shen,
- Abstract summary: We present S$3$POT, a contrast-driven framework synergizing face generation with self-supervised spatial prompting.<n>In particular, S$3$POT consists of three modules: Reference Generation, Feature enhancement, and Prompt Selection.<n>Experiments on a dedicatedly collected dataset demonstrate S$3$POT's superior performance and the effectiveness of each module.
- Score: 46.05577414378133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing face parsing methods usually misclassify occlusions as facial components. This is because occlusion is a high-level concept, it does not refer to a concrete category of object. Thus, constructing a real-world face dataset covering all categories of occlusion object is almost impossible and accurate mask annotation is labor-intensive. To deal with the problems, we present S$^3$POT, a contrast-driven framework synergizing face generation with self-supervised spatial prompting, to achieve occlusion segmentation. The framework is inspired by the insights: 1) Modern face generators' ability to realistically reconstruct occluded regions, creating an image that preserve facial geometry while eliminating occlusion, and 2) Foundation segmentation models' (e.g., SAM) capacity to extract precise mask when provided with appropriate prompts. In particular, S$^3$POT consists of three modules: Reference Generation (RF), Feature enhancement (FE), and Prompt Selection (PS). First, a reference image is produced by RF using structural guidance from parsed mask. Second, FE performs contrast of tokens between raw and reference images to obtain an initial prompt, then modifies image features with the prompt by cross-attention. Third, based on the enhanced features, PS constructs a set of positive and negative prompts and screens them with a self-attention network for a mask decoder. The network is learned under the guidance of three novel and complementary objective functions without occlusion ground truth mask involved. Extensive experiments on a dedicatedly collected dataset demonstrate S$^3$POT's superior performance and the effectiveness of each module.
Related papers
- Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding [1.7761223012399532]
Facial Image inpainting aims to restore the missing or corrupted regions in face images while preserving identity, structural consistency and image quality.<n>Existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region.<n>We propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis.
arXiv Detail & Related papers (2025-12-04T17:56:08Z) - GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation [81.0871900167463]
We introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation.<n>Given a textureless object, we render normal and point maps from predefined viewpoints.<n>We accept simple 2D prompts - clicks or boxes - to guide part selection.<n>The predicted masks are back-projected to the object and aggregated across views.
arXiv Detail & Related papers (2025-08-19T17:58:51Z) - PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training [32.52750192639004]
PaCo-FR is an unsupervised framework that combines masked image modeling with patch-pixel alignment.<n>PaCo-FR achieves state-of-the-art performance across several facial analysis tasks with just 2 million unlabeled images for pre-training.
arXiv Detail & Related papers (2025-08-13T10:37:41Z) - DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion [50.90541069907167]
We propose DeOcc-1-to-3, an end-to-end framework for occlusion-aware multi-view generation.<n>Our self-supervised training pipeline leverages occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency.
arXiv Detail & Related papers (2025-06-26T17:58:26Z) - SketchYourSeg: Mask-Free Subjective Image Segmentation via Freehand Sketches [116.1810651297801]
SketchYourSeg establishes freehand sketches as a powerful query modality for subjective image segmentation.<n>Our evaluations demonstrate superior performance over existing approaches across diverse benchmarks.
arXiv Detail & Related papers (2025-01-27T13:07:51Z) - Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - Enforcing View-Consistency in Class-Agnostic 3D Segmentation Fields [46.711276257688326]
Radiance Fields have become a powerful tool for modeling 3D scenes from multiple images.<n>Some methods work well using 2D semantic masks, but they generalize poorly to class-agnostic segmentations.<n>More recent methods circumvent this issue by using contrastive learning to optimize a high-dimensional 3D feature field instead.
arXiv Detail & Related papers (2024-08-19T12:07:24Z) - Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for
Occluded Facial Expression Recognition [0.0]
The proposed method can detect occluded parts of the face as if they were unoccluded, and recognize them, improving FER accuracy.
It involves three steps: First, the vision transformer (ViT)-based occlusion patch detector masks the occluded position by training only latent vectors from the unoccluded patches.
Second, the hybrid reconstruction network generates the masking position as a complete image using the ViT and convolutional neural network (CNN)
Last, the expression-relevant latent vector extractor retrieves and uses expression-related information from all latent vectors by applying a CNN-based class activation map
arXiv Detail & Related papers (2023-07-21T07:56:32Z) - SD-GAN: Semantic Decomposition for Face Image Synthesis with Discrete
Attribute [0.0]
We propose an innovative framework to tackle challenging facial discrete attribute synthesis via semantic decomposing, dubbed SD-GAN.
The fusion network integrates 3D embedding for better identity preservation and discrete attribute synthesis.
We construct a large and valuable dataset MEGN for completing the lack of discrete attributes in the existing dataset.
arXiv Detail & Related papers (2022-07-12T04:23:38Z) - Discovering Object Masks with Transformers for Unsupervised Semantic
Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation.
Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z) - Segmentation-Reconstruction-Guided Facial Image De-occlusion [48.952656891182826]
Occlusions are very common in face images in the wild, leading to the degraded performance of face-related tasks.
This paper proposes a novel face de-occlusion model based on face segmentation and 3D face reconstruction.
arXiv Detail & Related papers (2021-12-15T10:40:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.