Latents2Segments: Disentangling the Latent Space of Generative Models
for Semantic Segmentation of Face Images
- URL: http://arxiv.org/abs/2207.01871v2
- Date: Wed, 6 Jul 2022 06:54:09 GMT
- Title: Latents2Segments: Disentangling the Latent Space of Generative Models
for Semantic Segmentation of Face Images
- Authors: Snehal Singh Tomar and A.N. Rajagopalan
- Abstract summary: We do away with the priors and complex pre-processing operations required by SOTA multi-class face segmentation models.
We present results for our model's performance on the CelebAMask-HQ and HELEN datasets.
- Score: 29.496302682744133
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the advent of an increasing number of Augmented and Virtual Reality
applications that aim to perform meaningful and controlled style edits on
images of human faces, the impetus for the task of parsing face images to
produce accurate and fine-grained semantic segmentation maps is more than ever
before. Few State of the Art (SOTA) methods which solve this problem, do so by
incorporating priors with respect to facial structure or other face attributes
such as expression and pose in their deep classifier architecture. Our
endeavour in this work is to do away with the priors and complex pre-processing
operations required by SOTA multi-class face segmentation models by reframing
this operation as a downstream task post infusion of disentanglement with
respect to facial semantic regions of interest (ROIs) in the latent space of a
Generative Autoencoder model. We present results for our model's performance on
the CelebAMask-HQ and HELEN datasets. The encoded latent space of our model
achieves significantly higher disentanglement with respect to semantic ROIs
than that of other SOTA works. Moreover, it achieves a 13% faster inference
rate and comparable accuracy with respect to the publicly available SOTA for
the downstream task of semantic segmentation of face images.
Related papers
- SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation [69.42764583465508]
We explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks.
To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation.
arXiv Detail & Related papers (2024-03-25T10:30:22Z) - Faceptor: A Generalist Model for Face Perception [52.8066001012464]
Faceptor is proposed to adopt a well-designed single-encoder dual-decoder architecture.
Layer-Attention into Faceptor enables the model to adaptively select features from optimal layers to perform the desired tasks.
Our training framework can also be applied to auxiliary supervised learning, significantly improving performance in data-sparse tasks such as age estimation and expression recognition.
arXiv Detail & Related papers (2024-03-14T15:42:31Z) - EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models [52.3015009878545]
We develop an image segmentor capable of generating fine-grained segmentation maps without any additional training.
Our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps.
In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images.
arXiv Detail & Related papers (2024-01-22T07:34:06Z) - EFHQ: Multi-purpose ExtremePose-Face-HQ dataset [1.8194090162317431]
This work introduces a novel dataset named Extreme Pose Face High-Quality dataset (EFHQ), which includes a maximum of 450k high-quality images of faces at extreme poses.
To produce such a massive dataset, we utilize a novel and meticulous dataset processing pipeline to curate two publicly available datasets.
Our dataset can complement existing datasets on various facial-related tasks, such as facial synthesis with 2D/3D-aware GAN, diffusion-based text-to-image face generation, and face reenactment.
arXiv Detail & Related papers (2023-12-28T18:40:31Z) - Unlocking Pre-trained Image Backbones for Semantic Image Synthesis [29.688029979801577]
We propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images.
Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes.
arXiv Detail & Related papers (2023-12-20T09:39:19Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Realistic Speech-to-Face Generation with Speech-Conditioned Latent
Diffusion Model with Face Prior [13.198105709331617]
We propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM.
This is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation.
We show that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods.
arXiv Detail & Related papers (2023-10-05T07:44:49Z) - CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image
Segmentation [29.885991324519463]
We propose a novel cross-modality masked self-distillation framework named CM-MaskSD.
Our method inherits the transferred knowledge of image-text semantic alignment from CLIP model to realize fine-grained patch-word feature alignment.
Our framework can considerably boost model performance in a nearly parameter-free manner.
arXiv Detail & Related papers (2023-05-19T07:17:27Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z) - Empirical Study of Multi-Task Hourglass Model for Semantic Segmentation
Task [0.7614628596146599]
We propose to use a multi-task approach by complementing the semantic segmentation task with edge detection, semantic contour, and distance transform tasks.
We demonstrate the effectiveness of learning in a multi-task setting for hourglass models in the Cityscapes, CamVid, and Freiburg Forest datasets.
arXiv Detail & Related papers (2021-05-28T01:08:10Z) - InterFaceGAN: Interpreting the Disentangled Face Representation Learned
by GANs [73.27299786083424]
We propose a framework called InterFaceGAN to interpret the disentangled face representation learned by state-of-the-art GAN models.
We first find that GANs learn various semantics in some linear subspaces of the latent space.
We then conduct a detailed study on the correlation between different semantics and manage to better disentangle them via subspace projection.
arXiv Detail & Related papers (2020-05-18T18:01:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.