VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis
- URL: http://arxiv.org/abs/2509.23605v1
- Date: Sun, 28 Sep 2025 03:17:58 GMT
- Title: VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis
- Authors: Zeren Xiong, Yue Yu, Zedong Zhang, Shuo Chen, Jian Yang, Jun Li,
- Abstract summary: We propose a diffusion-based framework that synthesizes a single, coherent object by integrating two input images at both noise and latent levels.<n>Our method outperforms strong baselines in visual quality, semantic consistency, and human-rated creativity.
- Score: 23.50866105623598
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Creating novel images by fusing visual cues from multiple sources is a fundamental yet underexplored problem in image-to-image generation, with broad applications in artistic creation, virtual reality and visual media. Existing methods often face two key challenges: coexistent generation, where multiple objects are simply juxtaposed without true integration, and bias generation, where one object dominates the output due to semantic imbalance. To address these issues, we propose Visual Mixing Diffusion (VMDiff), a simple yet effective diffusion-based framework that synthesizes a single, coherent object by integrating two input images at both noise and latent levels. Our approach comprises: (1) a hybrid sampling process that combines guided denoising, inversion, and spherical interpolation with adjustable parameters to achieve structure-aware fusion, mitigating coexistent generation; and (2) an efficient adaptive adjustment module, which introduces a novel similarity-based score to automatically and adaptively search for optimal parameters, countering semantic bias. Experiments on a curated benchmark of 780 concept pairs demonstrate that our method outperforms strong baselines in visual quality, semantic consistency, and human-rated creativity.
Related papers
- Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers [55.15722080205737]
Edit2Perceive is a unified diffusion framework that adapts editing models for depth, normal, and matting.<n>Our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.
arXiv Detail & Related papers (2025-11-24T01:13:51Z) - Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching [31.42132290162457]
We introduce a new framework called IMD (Image feature Matching with a pre-trained Diffusion model) with two parts.<n>Unlike the dominant solutions employing contrastive-learning based foundation models that emphasize global semantics, we integrate the generative-based diffusion models.<n>Our proposed IMD establishes a new state-of-the-art in commonly evaluated benchmarks, and the superior 12% improvement in IMIM indicates our method efficiently mitigates the misalignment.
arXiv Detail & Related papers (2025-07-14T14:28:15Z) - Consistent Human Image and Video Generation with Spatially Conditioned Diffusion [82.4097906779699]
Consistent human-centric image and video synthesis aims to generate images with new poses while preserving appearance consistency with a given reference image.<n>We frame the task as a spatially-conditioned inpainting problem, where the target image is in-painted to maintain appearance consistency with the reference.<n>This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network.
arXiv Detail & Related papers (2024-12-19T05:02:30Z) - DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion [35.60459492849359]
We study the problem of generating intermediate images from image pairs with large motion.
Due to the large motion, the intermediate semantic information may be absent in input images.
We propose DreamMover, a novel image framework with three main components.
arXiv Detail & Related papers (2024-09-15T04:09:12Z) - Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media [34.664388374279596]
We propose a Similarity-Guided Fusion Transformer (SG-MFT) for predicting the semantic locations of users from their multimodal posts.
First, we incorporate high-quality text and image representations by utilizing a pre-trained large vision-language model.
We then devise a Similarity-Guided Interaction Module (SIM) to alleviate modality heterogeneity and noise interference.
arXiv Detail & Related papers (2024-05-09T13:32:26Z) - OneActor: Consistent Character Generation via Cluster-Conditioned Guidance [29.426558840522734]
We propose a novel one-shot tuning paradigm, termed OneActor.
It efficiently performs consistent subject generation solely driven by prompts.
Our method is capable of multi-subject generation and compatible with popular diffusion extensions.
arXiv Detail & Related papers (2024-04-16T03:45:45Z) - Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion
Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps.
We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores.
Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z) - Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis [15.76266032768078]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries.<n>We first introduce vision guidance as a foundational spatial cue within the perturbed distribution.<n>We propose a universal framework, Layered Rendering Diffusion (LRDiff), which constructs an image-rendering process with multiple layers.
arXiv Detail & Related papers (2023-11-30T10:36:19Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - Single Stage Virtual Try-on via Deformable Attention Flows [51.70606454288168]
Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image.
We develop a novel Deformable Attention Flow (DAFlow) which applies the deformable attention scheme to multi-flow estimation.
Our proposed method achieves state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-07-19T10:01:31Z) - Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.