Taming Identity Consistency and Prompt Diversity in Diffusion Models via Latent Concatenation and Masked Conditional Flow Matching
- URL: http://arxiv.org/abs/2511.08061v1
- Date: Wed, 12 Nov 2025 01:37:22 GMT
- Title: Taming Identity Consistency and Prompt Diversity in Diffusion Models via Latent Concatenation and Masked Conditional Flow Matching
- Authors: Aditi Singhania, Arushi Jain, Krutik Malani, Riddhi Dhawan, Souymodip Chakraborty, Vineet Batra, Ankit Phogat,
- Abstract summary: Subject-driven image generation aims to synthesize novel depictions of a specific subject across diverse contexts.<n>We propose a LoRA fine-tuned diffusion model employing a latent concatenation strategy.<n>For filtering and quality assessment, we present CHARIS, a fine-grained evaluation framework.
- Score: 1.9270911143386336
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Subject-driven image generation aims to synthesize novel depictions of a specific subject across diverse contexts while preserving its core identity features. Achieving both strong identity consistency and high prompt diversity presents a fundamental trade-off. We propose a LoRA fine-tuned diffusion model employing a latent concatenation strategy, which jointly processes reference and target images, combined with a masked Conditional Flow Matching (CFM) objective. This approach enables robust identity preservation without architectural modifications. To facilitate large-scale training, we introduce a two-stage Distilled Data Curation Framework: the first stage leverages data restoration and VLM-based filtering to create a compact, high-quality seed dataset from diverse sources; the second stage utilizes these curated examples for parameter-efficient fine-tuning, thus scaling the generation capability across various subjects and contexts. Finally, for filtering and quality assessment, we present CHARIS, a fine-grained evaluation framework that performs attribute-level comparisons along five key axes: identity consistency, prompt adherence, region-wise color fidelity, visual quality, and transformation diversity.
Related papers
- DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer [21.788582116033684]
Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video.<n>Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency.<n>We propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping to the video domain.
arXiv Detail & Related papers (2026-01-04T08:07:11Z) - WithAnyone: Towards Controllable and ID Consistent Image Generation [83.55786496542062]
Identity-consistent generation has become an important focus in text-to-image research.<n>We develop a large-scale paired dataset tailored for multi-person scenarios.<n>We propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity.
arXiv Detail & Related papers (2025-10-16T17:59:54Z) - Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation [87.48785461212556]
We present our approach to the DataCV ICCV Challenge, which centers on building a high-quality face dataset to train a face recognition model.<n>The constructed dataset must not contain identities overlapping with any existing public face datasets.<n>Our method achieves textbf1st place in the competition, and experimental results show that our dataset improves model performance across 10K, 20K, and 100K identity scales.
arXiv Detail & Related papers (2025-08-14T14:14:18Z) - Subject-Consistent and Pose-Diverse Text-to-Image Generation [36.67159307721023]
We propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi.<n>It enables consistent subject generation with diverse pose and layout.<n>CoDi achieves both better visual perception and stronger performance across all metrics.
arXiv Detail & Related papers (2025-07-11T08:15:56Z) - Dual-Granularity Cross-Modal Identity Association for Weakly-Supervised Text-to-Person Image Matching [7.1469465755934785]
Weakly supervised text-to-person image matching is a crucial approach to reducing models' reliance on large-scale manually labeled samples.<n>We propose a dual-granularity identity association mechanism to predict complex one-to-many identity relationships.<n> Experimental results demonstrate that the proposed method substantially boosts cross-modal matching accuracy.
arXiv Detail & Related papers (2025-07-09T10:59:13Z) - Noise Consistency Regularization for Improved Subject-Driven Image Synthesis [55.75426086791612]
Fine-tuning Stable Diffusion enables subject-driven image synthesis by adapting the model to generate images containing specific subjects.<n>Existing fine-tuning methods suffer from two key issues: underfitting, where the model fails to reliably capture subject identity, and overfitting, where it memorizes the subject image and reduces background diversity.<n>We propose two auxiliary consistency losses for diffusion fine-tuning. First, a prior consistency regularization loss ensures that the predicted diffusion noise for prior (non-subject) images remains consistent with that of the pretrained model, improving fidelity.
arXiv Detail & Related papers (2025-06-06T19:17:37Z) - Identity-Preserving Text-to-Image Generation via Dual-Level Feature Decoupling and Expert-Guided Fusion [35.67333978414322]
We propose a novel framework that improves the separation of identity-related and identity-unrelated features.<n>Our framework consists of two key components: an Implicit-Explicit foreground-background Decoupling Module and a Feature Fusion Module.
arXiv Detail & Related papers (2025-05-28T13:40:46Z) - SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation [0.6554326244334868]
We propose a novel framework that explicitly integrates diversity, faithfulness, and label clarity into the augmentation process.<n>Our approach employs saliency-guided mixing and a fine-tuned diffusion model to preserve foreground semantics, enrich background diversity, and ensure label consistency.
arXiv Detail & Related papers (2025-05-17T03:51:18Z) - ID$^3$: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition [60.15830516741776]
Synthetic face recognition (SFR) aims to generate datasets that mimic the distribution of real face data.
We introduce a diffusion-fueled SFR model termed $textID3$.
$textID3$ employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances.
arXiv Detail & Related papers (2024-09-26T06:46:40Z) - Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting [49.87694319431288]
Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources.
We propose a Comprehensive Generative (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs.
Experiments on incremental tasks (cardiac, fundus and prostate segmentation) show its clear advantage for alleviating concurrent appearance and semantic forgetting.
arXiv Detail & Related papers (2024-06-28T10:05:58Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations [61.132408427908175]
zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain.
With only a single representative text feature instead of real images, the synthesized images gradually lose diversity.
We propose a novel method to find semantic variations of the target text in the CLIP space.
arXiv Detail & Related papers (2023-08-21T08:12:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.