Related papers: MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing

MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing

URL: http://arxiv.org/abs/2505.02823v1
Date: Mon, 05 May 2025 17:50:24 GMT
Title: MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing
Authors: Zinan Guo, Pengze Zhang, Yanze Wu, Chong Mou, Songtao Zhao, Qian He,
Abstract summary: MUSAR is a framework to achieve robust multi-subject customization while requiring only single-subject training data.<n>It constructs diptych training pairs from single-subject images to facilitate multi-subject learning, while actively correcting the distribution bias introduced by diptych construction.<n>Experiments demonstrate that our MUSAR outperforms existing methods - even those trained on multi-subject dataset.
Score: 14.88610127301938
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current multi-subject customization approaches encounter two critical challenges: the difficulty in acquiring diverse multi-subject training data, and attribute entanglement across different subjects. To bridge these gaps, we propose MUSAR - a simple yet effective framework to achieve robust multi-subject customization while requiring only single-subject training data. Firstly, to break the data limitation, we introduce debiased diptych learning. It constructs diptych training pairs from single-subject images to facilitate multi-subject learning, while actively correcting the distribution bias introduced by diptych construction via static attention routing and dual-branch LoRA. Secondly, to eliminate cross-subject entanglement, we introduce dynamic attention routing mechanism, which adaptively establishes bijective mappings between generated images and conditional subjects. This design not only achieves decoupling of multi-subject representations but also maintains scalable generalization performance with increasing reference subjects. Comprehensive experiments demonstrate that our MUSAR outperforms existing methods - even those trained on multi-subject dataset - in image quality, subject consistency, and interaction naturalness, despite requiring only single-subject dataset.

Related papers

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models [123.17643568298116]
We present TAViS, a novel framework that textbfcouples the knowledge of multimodal foundation models for cross-modal alignment.<n> effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision.<n>Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
arXiv Detail & Related papers (2025-06-13T03:19:47Z)
FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models [23.830133838392964]
We propose FedRSCLIP, the first federated learning framework for remote sensing image classification based on a VLM, specifically CLIP.<n>FedRSCLIP addresses the challenges of data heterogeneity and large-scale model transmission in federated environments by introducing Prompt Learning.<n>To validate the effectiveness of our proposed model, we construct a Fed-RSIC dataset based on three existing remote sensing image classification datasets.
arXiv Detail & Related papers (2025-01-05T07:10:27Z)
Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances across multiple modalities while relying on scarce labeled data.<n>We propose a Generative Transfer Learning framework by simulating how humans abstract and generalize concepts.<n>We show that the GTL achieves state-of-the-art performance across seven multi-modal datasets across RGB-Sketch, RGB-Infrared, and RGB-Depth.
arXiv Detail & Related papers (2024-10-14T16:09:38Z)
CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization [27.114395240088562]
We argue an ideal subject representation can be achieved by a cross-differential perspective, i.e., decoupling subject intrinsic attributes from irrelevant attributes via contrastive learning.<n>Specifically, we propose CustomContrast, a novel framework, which includes a Multilevel Contrastive Learning paradigm and a Multimodal Feature Injection (MFI)<n> Extensive experiments show the effectiveness of CustomContrast in subject similarity and text controllability.
arXiv Detail & Related papers (2024-09-09T13:39:47Z)
Unity in Diversity: Multi-expert Knowledge Confrontation and Collaboration for Generalizable Vehicle Re-identification [60.20318058777603]
Generalizable vehicle re-identification (ReID) seeks to develop models that can adapt to unknown target domains without the need for fine-tuning or retraining.<n>Previous works have mainly focused on extracting domain-invariant features by aligning data distributions between source domains.<n>We propose a two-stage Multi-expert Knowledge Confrontation and Collaboration (MiKeCoCo) method to solve this unique problem.
arXiv Detail & Related papers (2024-07-10T04:06:39Z)
MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [5.452759083801634]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.<n>The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z)
Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models [66.05234562835136]
We present MuDI, a novel framework that enables multi-subject personalization. Our main idea is to utilize segmented subjects generated by a foundation model for segmentation. Experimental results show that our MuDI can produce high-quality personalized images without identity mixing.
arXiv Detail & Related papers (2024-04-05T17:45:22Z)
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z)
Cones 2: Customizable Image Synthesis with Multiple Subjects [50.54010141032032]
We study how to efficiently represent a particular subject as well as how to appropriately compose different subjects. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image.
arXiv Detail & Related papers (2023-05-30T18:00:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.