FPGA: Flexible Portrait Generation Approach
- URL: http://arxiv.org/abs/2408.09248v3
- Date: Sun, 23 Feb 2025 12:25:46 GMT
- Title: FPGA: Flexible Portrait Generation Approach
- Authors: Zhaoli Deng, Fanyi Wang, Junkang Zhang, Fan Chen, Meng Zhang, Wendong Zhang, Wen Liu, Zhenpeng Mi,
- Abstract summary: We propose a comprehensive system called FPGA to construct a million-level multi-modal dataset IDZoom for training.<n>FPGA consists of Multi-Mode Fusion training strategy (MMF) and DDIM Inversion based ID Restoration inference framework (DIIR)<n>DIIR is plug-and-play and can be applied to any diffusion-based portrait generation method to enhance their performance.
- Score: 11.002947043723617
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Portrait Fidelity Generation is a prominent research area in generative models.Current methods face challenges in generating full-body images with low-resolution faces, especially in multi-ID photo phenomenon.To tackle these issues, we propose a comprehensive system called FPGA and construct a million-level multi-modal dataset IDZoom for training.FPGA consists of Multi-Mode Fusion training strategy (MMF) and DDIM Inversion based ID Restoration inference framework (DIIR). The MMF aims to activate the specified ID in the specified facial region. The DIIR aims to address the issue of face artifacts while keeping the background.Furthermore, DIIR is plug-and-play and can be applied to any diffusion-based portrait generation method to enhance their performance. DIIR is also capable of performing face-swapping tasks and is applicable to stylized faces as well.To validate the effectiveness of FPGA, we conducted extensive comparative and ablation experiments. The experimental results demonstrate that FPGA has significant advantages in both subjective and objective metrics, and achieves controllable generation in multi-ID scenarios. In addition, we accelerate the inference speed to within 2.5 seconds on a single L20 graphics card mainly based on our well designed reparameterization method, RepControlNet.
Related papers
- DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning [8.184155602678754]
DreamID is a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed.
We propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter.
DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity.
arXiv Detail & Related papers (2025-04-20T06:53:00Z) - FusionSegReID: Advancing Person Re-Identification with Multimodal Retrieval and Precise Segmentation [42.980289787679084]
Person re-identification (ReID) plays a critical role in applications like security surveillance and criminal investigations by matching individuals across large image galleries captured by non-overlapping cameras.
Traditional ReID methods rely on unimodal inputs, typically images, but face limitations due to challenges like occlusions, lighting changes, and pose variations.
This paper presents FusionSegReID, a multimodal model that combines both image and text inputs for enhanced ReID performance.
arXiv Detail & Related papers (2025-03-27T15:14:03Z) - MMGen: Unified Multi-modal Image Generation and Understanding in One Go [60.97155790727879]
We introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model.
Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy.
arXiv Detail & Related papers (2025-03-26T15:37:17Z) - Multi-focal Conditioned Latent Diffusion for Person Image Synthesis [59.113899155476005]
The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation.
We propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations.
Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information.
arXiv Detail & Related papers (2025-03-19T20:50:10Z) - Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images.
Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights.
Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z) - Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training [51.87027943520492]
We present a novel paradigm Diffusion-ReID to efficiently augment and generate diverse images based on known identities.
Benefiting from our proposed paradigm, we first create a new large-scale person Re-ID dataset Diff-Person, which consists of over 777K images from 5,183 identities.
arXiv Detail & Related papers (2024-06-10T06:26:03Z) - InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation [0.0]
"InstantFamily" is an approach that employs a novel cross-attention mechanism and a multimodal embedding stack to achieve zero-shot multi-ID image generation.
Our method effectively preserves ID as it utilizes global and local features from a pre-trained face recognition model integrated with text conditions.
arXiv Detail & Related papers (2024-04-30T10:16:21Z) - ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving [66.09976326184066]
ConsistentID is an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts.
We present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets.
arXiv Detail & Related papers (2024-04-25T17:23:43Z) - Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID.
Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z) - Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for
Loss-free Multi-Exposure Image Fusion [60.221404321514086]
Multi-exposure image fusion (MEF) has emerged as a prominent solution to address the limitations of digital imaging in representing varied exposure levels.
This paper presents a Hybrid-Supervised Dual-Search approach for MEF, dubbed HSDS-MEF, which introduces a bi-level optimization search scheme for automatic design of both network structures and loss functions.
arXiv Detail & Related papers (2023-09-03T08:07:26Z) - Learning Progressive Modality-shared Transformers for Effective
Visible-Infrared Person Re-identification [27.75907274034702]
We propose a novel deep learning framework named Progressive Modality-shared Transformer (PMT) for effective VI-ReID.
To reduce the negative effect of modality gaps, we first take the gray-scale images as an auxiliary modality and propose a progressive learning strategy.
To cope with the problem of large intra-class differences and small inter-class differences, we propose a Discriminative Center Loss.
arXiv Detail & Related papers (2022-12-01T02:20:16Z) - Physically-Based Face Rendering for NIR-VIS Face Recognition [165.54414962403555]
Near infrared (NIR) to Visible (VIS) face matching is challenging due to the significant domain gaps.
We propose a novel method for paired NIR-VIS facial image generation.
To facilitate the identity feature learning, we propose an IDentity-based Maximum Mean Discrepancy (ID-MMD) loss.
arXiv Detail & Related papers (2022-11-11T18:48:16Z) - MAXIM: Multi-Axis MLP for Image Processing [19.192826213493838]
We present a multi-axis based architecture, called MAXIM, that can serve as an efficient general-purpose vision backbone for image processing tasks.
MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gateds.
Results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks.
arXiv Detail & Related papers (2022-01-09T09:59:32Z) - Efficient and Accurate Multi-scale Topological Network for Single Image
Dehazing [31.543771270803056]
In this paper, we pay attention to the feature extraction and utilization of the input image itself.
We propose a Multi-scale Topological Network (MSTN) to fully explore the features at different scales.
Meanwhile, we design a Multi-scale Feature Fusion Module (MFFM) and an Adaptive Feature Selection Module (AFSM) to achieve the selection and fusion of features at different scales.
arXiv Detail & Related papers (2021-02-24T08:53:14Z) - DCDLearn: Multi-order Deep Cross-distance Learning for Vehicle
Re-Identification [22.547915009758256]
This paper formulates a multi-order deep cross-distance learning model for vehicle re-identification.
One-view CycleGAN model is developed to alleviate exhaustive and enumerative cross-camera matching problem.
Experiments on three vehicle Re-ID datasets demonstrate that the proposed method achieves significant improvement over the state-of-the-arts.
arXiv Detail & Related papers (2020-03-25T10:46:54Z) - ASFD: Automatic and Scalable Face Detector [129.82350993748258]
We propose a novel Automatic and Scalable Face Detector (ASFD)
ASFD is based on a combination of neural architecture search techniques as well as a new loss design.
Our ASFD-D6 outperforms the prior strong competitors, and our lightweight ASFD-D0 runs at more than 120 FPS with Mobilenet for VGA-resolution images.
arXiv Detail & Related papers (2020-03-25T06:00:47Z) - Cross-Resolution Adversarial Dual Network for Person Re-Identification
and Beyond [59.149653740463435]
Person re-identification (re-ID) aims at matching images of the same person across camera views.
Due to varying distances between cameras and persons of interest, resolution mismatch can be expected.
We propose a novel generative adversarial network to address cross-resolution person re-ID.
arXiv Detail & Related papers (2020-02-19T07:21:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.