Contrastive Attention Network with Dense Field Estimation for Face
Completion
- URL: http://arxiv.org/abs/2112.10310v1
- Date: Mon, 20 Dec 2021 02:54:38 GMT
- Title: Contrastive Attention Network with Dense Field Estimation for Face
Completion
- Authors: Xin Ma, Xiaoqiang Zhou, Huaibo Huang, Gengyun Jia, Zhenhua Chai,
Xiaolin Wei
- Abstract summary: We propose a self-supervised Siamese inference network to improve the generalization and robustness of encoders.
To deal with geometric variations of face images, a dense correspondence field is integrated into the network.
This multi-scale architecture is beneficial for the decoder to utilize discriminative representations learned from encoders into images.
- Score: 11.631559190975034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most modern face completion approaches adopt an autoencoder or its variants
to restore missing regions in face images. Encoders are often utilized to learn
powerful representations that play an important role in meeting the challenges
of sophisticated learning tasks. Specifically, various kinds of masks are often
presented in face images in the wild, forming complex patterns, especially in
this hard period of COVID-19. It's difficult for encoders to capture such
powerful representations under this complex situation. To address this
challenge, we propose a self-supervised Siamese inference network to improve
the generalization and robustness of encoders. It can encode contextual
semantics from full-resolution images and obtain more discriminative
representations. To deal with geometric variations of face images, a dense
correspondence field is integrated into the network. We further propose a
multi-scale decoder with a novel dual attention fusion module (DAF), which can
combine the restored and known regions in an adaptive manner. This multi-scale
architecture is beneficial for the decoder to utilize discriminative
representations learned from encoders into images. Extensive experiments
clearly demonstrate that the proposed approach not only achieves more appealing
results compared with state-of-the-art methods but also improves the
performance of masked face recognition dramatically.
Related papers
- MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia.
Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored.
We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z) - Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation.
Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model.
Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - Leveraging Image Complexity in Macro-Level Neural Network Design for
Medical Image Segmentation [3.974175960216864]
We show that image complexity can be used as a guideline in choosing what is best for a given dataset.
For high-complexity datasets, a shallow network running on the original images may yield better segmentation results than a deep network running on downsampled images.
arXiv Detail & Related papers (2021-12-21T09:49:47Z) - LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [80.54244087314025]
We show that better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in vision Transformer encoder network.
Our method surpasses the previous state-of-the-art methods on RefCOCO, RefCO+, and G-Ref by large margins.
arXiv Detail & Related papers (2021-12-04T04:53:35Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - OLED: One-Class Learned Encoder-Decoder Network with Adversarial Context
Masking for Novelty Detection [1.933681537640272]
novelty detection is the task of recognizing samples that do not belong to the distribution of the target class.
Deep autoencoders have been widely used as a base of many unsupervised novelty detection methods.
We have designed a framework consisting of two competing networks, a Mask Module and a Reconstructor.
arXiv Detail & Related papers (2021-03-27T17:59:40Z) - Attention-Based Multimodal Image Matching [16.335191345543063]
We propose an attention-based approach for multimodal image patch matching using a Transformer encoder.
Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues.
This is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task.
arXiv Detail & Related papers (2021-03-20T21:14:24Z) - Free-Form Image Inpainting via Contrastive Attention Network [64.05544199212831]
In image inpainting tasks, masks with any shapes can appear anywhere in images which form complex patterns.
It is difficult for encoders to capture such powerful representations under this complex situation.
We propose a self-supervised Siamese inference network to improve the robustness and generalization.
arXiv Detail & Related papers (2020-10-29T14:46:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.