Related papers: Towards Generalized Multi-Image Editing for Unified Multimodal Models

Towards Generalized Multi-Image Editing for Unified Multimodal Models

URL: http://arxiv.org/abs/2601.05572v1
Date: Fri, 09 Jan 2026 06:42:49 GMT
Title: Towards Generalized Multi-Image Editing for Unified Multimodal Models
Authors: Pengcheng Xu, Peng Tang, Donghao Luo, Xiaobin Hu, Weichu Cui, Qingdong He, Zhennan Chen, Jiangning Zhang, Charles Ling, Boyu Wang,
Abstract summary: Unified Multimodal Models (UMMs) integrate multimodal understanding and generation.<n>UMMs are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images.<n>We propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts.
Score: 56.620038824933566
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unified Multimodal Models (UMMs) integrate multimodal understanding and generation, yet they are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images. In this work, we propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts. Algorithmically, we introduce two innovations: 1) The learnable latent separators explicitly differentiate each reference image in the latent space, enabling accurate and disentangled conditioning. 2) The sinusoidal index encoding assigns visual tokens from the same image a continuous sinusoidal index embedding, which provides explicit image identity while allowing generalization and extrapolation on a variable number of inputs. To facilitate training and evaluation, we establish a high-fidelity benchmark using an inverse dataset construction methodology to guarantee artifact-free, achievable outputs. Experiments show clear improvements in semantic consistency, visual fidelity, and cross-image integration over prior baselines on diverse multi-image editing tasks, validating our advantages on consistency and generalization ability.

Related papers

MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models [89.89575486159795]
We introduce textbfMICON-Bench, a benchmark for multi-image context generation.<n>We propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency.<n>We also present textbfDynamic Attention Rebalancing (DAR), a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations.
arXiv Detail & Related papers (2026-02-23T04:32:52Z)
MAUGen: A Unified Diffusion Approach for Multi-Identity Facial Expression and AU Label Generation [18.996319133901473]
We propose MAUGen, a diffusion-based multi-modal framework that jointly generates a large collection of photorealistic facial expressions.<n>Under this framework, we introduce Multi-Identity Facial Action (MIFA), a large-scale multimodal synthetic dataset featuring comprehensive AU annotations and identity variations.
arXiv Detail & Related papers (2026-01-31T07:56:22Z)
Aggregating Diverse Cue Experts for AI-Generated Image Detection [34.80872382508184]
We introduce the Multi-Cue Aggregation Network (MCAN), a novel framework that integrates different yet complementary cues in a unified network.<n>MCAN employs a mixture-of-encoders adapter to dynamically process these cues, enabling more adaptive and robust feature representation.<n>In the GenImage dataset, MCAN outperforms the best state-of-the-art method by up to 7.4% in average ACC across eight different image generators.
arXiv Detail & Related papers (2026-01-13T18:23:42Z)
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models [96.41974190202642]
Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework.<n>We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder.
arXiv Detail & Related papers (2025-12-01T18:59:51Z)
LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer [32.9330637921386]
LAMIC is a Layout-Aware Multi-Image Composition framework.<n>It extends single-reference diffusion models to multi-reference scenarios in a training-free manner.<n>It consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings.
arXiv Detail & Related papers (2025-08-01T09:51:54Z)
Omni-ID: Holistic Identity Representation Designed for Generative Tasks [75.29174595706533]
Omni-ID encodes holistic information about an individual's appearance across diverse expressions.<n>It consolidates information from a varied number of unstructured input images into a structured representation.<n>It demonstrates substantial improvements over conventional representations across various generative tasks.
arXiv Detail & Related papers (2024-12-12T19:21:20Z)
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM [38.8308841469793]
This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt.<n>We leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM) to exploit consistent visual elements within multiple images.<n> Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.
arXiv Detail & Related papers (2024-12-12T18:59:48Z)
Unity in Diversity: Multi-expert Knowledge Confrontation and Collaboration for Generalizable Vehicle Re-identification [60.20318058777603]
Generalizable vehicle re-identification (ReID) seeks to develop models that can adapt to unknown target domains without the need for fine-tuning or retraining.<n>Previous works have mainly focused on extracting domain-invariant features by aligning data distributions between source domains.<n>We propose a two-stage Multi-expert Knowledge Confrontation and Collaboration (MiKeCoCo) method to solve this unique problem.
arXiv Detail & Related papers (2024-07-10T04:06:39Z)
MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching.<n>We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.<n>Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z)
Multi-Image Summarization: Textual Summary from a Set of Cohesive Images [17.688344968462275]
This paper proposes the new task of multi-image summarization. It aims to generate a concise and descriptive textual summary given a coherent set of input images. A dense average image feature aggregation network allows the model to focus on a coherent subset of attributes.
arXiv Detail & Related papers (2020-06-15T18:45:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.