Related papers: ReMix: Towards a Unified View of Consistent Character Generation and Editing

ReMix: Towards a Unified View of Consistent Character Generation and Editing

URL: http://arxiv.org/abs/2510.10156v1
Date: Sat, 11 Oct 2025 10:31:56 GMT
Title: ReMix: Towards a Unified View of Consistent Character Generation and Editing
Authors: Benjia Zhou, Bin Fu, Pei Cheng, Yanru Wang, Jiayuan Fan, Tao Chen,
Abstract summary: ReMix is a unified framework for character-consistent generation and editing.<n>It constitutes two core components: the ReMix Module and IP-ControlNet.<n>ReMix supports a wide range of tasks, including personalized generation, image editing, style transfer, and multi-condition synthesis.
Score: 22.04681457337335
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large-scale text-to-image diffusion models (e.g., FLUX.1) have greatly improved visual fidelity in consistent character generation and editing. However, existing methods rarely unify these tasks within a single framework. Generation-based approaches struggle with fine-grained identity consistency across instances, while editing-based methods often lose spatial controllability and instruction alignment. To bridge this gap, we propose ReMix, a unified framework for character-consistent generation and editing. It constitutes two core components: the ReMix Module and IP-ControlNet. The ReMix Module leverages the multimodal reasoning ability of MLLMs to edit semantic features of input images and adapt instruction embeddings to the native DiT backbone without fine-tuning. While this ensures coherent semantic layouts, pixel-level consistency and pose controllability remain challenging. To address this, IP-ControlNet extends ControlNet to decouple semantic and layout cues from reference images and introduces an {\epsilon}-equivariant latent space that jointly denoises the reference and target images within a shared noise space. Inspired by convergent evolution and quantum decoherence,i.e., where environmental noise drives state convergence, this design promotes feature alignment in the hidden space, enabling consistent object generation while preserving identity. ReMix supports a wide range of tasks, including personalized generation, image editing, style transfer, and multi-condition synthesis. Extensive experiments validate its effectiveness and efficiency as a unified framework for character-consistent image generation and editing.

Related papers

Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement [54.199726425201895]
Multimodal editing large models have demonstrated powerful editing capabilities across diverse tasks.<n>Current facial ID preservation methods struggle to achieve consistent restoration of both facial identity and edited element IP.<n>We propose EditedID, an Alignment-Disentanglement-Entanglement framework for robust identity-specific facial restoration.
arXiv Detail & Related papers (2026-02-21T08:24:42Z)
MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation [76.94658056824422]
MoGen is a user-friendly multi-object image generation method.<n>First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image regions.<n>We introduce an Adaptive Multi-modal Guidance (AMG) module, which adaptively parses and integrates various combinations of multi-source control signals.
arXiv Detail & Related papers (2026-01-09T05:57:48Z)
MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues [106.02577891104079]
We propose MagicQuill V2, a novel system that introduces a textbflayered composition paradigm to generative image editing.<n>Our method deconstructs creative intent into a stack of controllable visual cues.
arXiv Detail & Related papers (2025-12-02T18:59:58Z)
IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout [36.70548378032599]
We study quantity and layout consistent image editing, abbreviated as QL-Edit, in multi-object scenes.<n>We present IMAGHarmony, a framework featuring a plug-and-play harmony aware (HA) module that fuses perception semantics while modeling object counts and locations.<n>We also present a preference-guided noise selection (PNS) strategy that selects semantically aligned initial noise through vision and language matching.
arXiv Detail & Related papers (2025-06-02T17:59:09Z)
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control [8.685610154314459]
diffusion models show extraordinary talents in text-to-image generation, but they may still fail to generate highly aesthetic images.<n>We propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter.<n>Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method.
arXiv Detail & Related papers (2024-12-30T08:47:25Z)
BrushEdit: All-In-One Image Inpainting and Editing [76.93556996538398]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm.<n>We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model.<n>Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z)
DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting [56.77074226109392]
We propose DreamMix, a diffusion-based framework adept at inserting target objects into user-specified regions.<n>We show that DreamMix achieves a superior balance between identity preservation and attribute editability across diverse applications.
arXiv Detail & Related papers (2024-11-26T08:44:47Z)
ZePo: Zero-Shot Portrait Stylization with Faster Sampling [61.14140480095604]
This paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps. We propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control.
arXiv Detail & Related papers (2024-08-10T08:53:41Z)
DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task. We first apply attention masking in each denoising step to make the generation more disentangled across different objects. In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z)
LoMOE: Localized Multi-Object Editing via Multi-Diffusion [8.90467024388923]
We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process. Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions. A combination of cross-attention and background losses within the latent space ensures that the characteristics of the object being edited are preserved.
arXiv Detail & Related papers (2024-03-01T10:46:47Z)
MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing [54.712205852602736]
We develop MasaCtrl, a tuning-free method to achieve consistent image generation and complex non-rigid image editing simultaneously. Specifically, MasaCtrl converts existing self-attention in diffusion models into mutual self-attention, so that it can query correlated local contents and textures from source images for consistency. Extensive experiments show that the proposed MasaCtrl can produce impressive results in both consistent image generation and complex non-rigid real image editing.
arXiv Detail & Related papers (2023-04-17T17:42:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.