Related papers: MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

URL: http://arxiv.org/abs/2512.22310v1
Date: Fri, 26 Dec 2025 09:29:30 GMT
Title: MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation
Authors: Run Ling, Ke Cao, Jian Lu, Ao Ma, Haowei Liu, Runze He, Changwei Wang, Rongtao Xu, Yihua Shao, Zhanjie Zhang, Peng Wu, Guibing Guo, Wei Feng, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Xingwei Wang,
Abstract summary: MoFu is a unified framework that tackles scale inconsistency and permutation sensitivity.<n>MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.
Score: 48.45457225939052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation. To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.

Related papers

Transformer Modeling for Both Scalability and Performance in Multivariate Time Series [0.0]
We propose a transformer with Delegate Token Attention (DELTAformer) to constrain inter-variable modeling.<n>Our results show that DELTAformer scales linearly with variable-count while actually outperforming standard transformers.
arXiv Detail & Related papers (2025-09-23T18:28:24Z)
Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention [0.0]
Learnable Multi-Scale Wavelet Transformer (LMWT) is a novel architecture that replaces the standard dot-product self-attention.<n>We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework.<n>Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages.
arXiv Detail & Related papers (2025-04-08T22:16:54Z)
VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS [4.996271098355553]
VarGes is a novel variation-driven framework designed to enhance co-speech gesture generation.<n>Our approach is validated on benchmark datasets, where it outperforms existing methods in terms of gesture diversity and naturalness.
arXiv Detail & Related papers (2025-02-15T08:46:01Z)
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation [71.24909962718128]
We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation.<n>Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities.
arXiv Detail & Related papers (2024-08-22T16:32:32Z)
No Re-Train, More Gain: Upgrading Backbones with Diffusion model for Pixel-Wise and Weakly-Supervised Few-Shot Segmentation [22.263029309151467]
Few-Shot (FSS) aims to segment novel classes using only a few annotated images.<n>Current FSS methods face three issues: the inflexibility of backbone upgrade without re-training, the inability to uniformly handle various types of annotations.<n>We propose DiffUp, a novel framework that conceptualizes the FSS task as a conditional generative problem using a diffusion process.
arXiv Detail & Related papers (2024-07-23T05:09:07Z)
Learning Modulated Transformation in GANs [69.95217723100413]
We equip the generator in generative adversarial networks (GANs) with a plug-and-play module, termed as modulated transformation module (MTM) MTM predicts spatial offsets under the control of latent codes, based on which the convolution operation can be applied at variable locations. It is noteworthy that towards human generation on the challenging TaiChi dataset, we improve the FID of StyleGAN3 from 21.36 to 13.60, demonstrating the efficacy of learning modulated geometry transformation.
arXiv Detail & Related papers (2023-08-29T17:51:22Z)
Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST) CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background. Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z)
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency. The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on. Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z)
Crowd Counting via Hierarchical Scale Recalibration Network [61.09833400167511]
We propose a novel Hierarchical Scale Recalibration Network (HSRNet) to tackle the task of crowd counting. HSRNet models rich contextual dependencies and recalibrating multiple scale-associated information. Our approach can ignore various noises selectively and focus on appropriate crowd scales automatically.
arXiv Detail & Related papers (2020-03-07T10:06:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.