Related papers: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

URL: http://arxiv.org/abs/2506.07986v3
Date: Wed, 23 Jul 2025 03:45:11 GMT
Title: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
Authors: Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong,
Abstract summary: Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation.<n>We propose textbfTemperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions.<n>Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models.
Score: 79.94246924019984
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{https://github.com/Vchitect/TACA}

Related papers

UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation [15.585320469279813]
Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding.<n>We propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps.<n>UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead.
arXiv Detail & Related papers (2025-08-07T13:51:17Z)
BSAFusion: A Bidirectional Stepwise Feature Alignment Network for Unaligned Medical Image Fusion [11.306367018981678]
This paper proposes an unaligned medical image fusion method called Bidirectional Stepwise Feature Alignment and Fusion.<n>In terms of feature alignment, BSFA-F employs a bidirectional stepwise alignment deformation field prediction strategy.<n>The experimental results across multiple datasets demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-12-11T02:56:23Z)
Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory [33.78620829249978]
Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images. Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding. We propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties. Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment.
arXiv Detail & Related papers (2024-11-25T10:57:48Z)
Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task. MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities. We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z)
Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z)
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching [48.04807531821024]
We propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism.<n>We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens.<n>Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL.
arXiv Detail & Related papers (2024-04-04T17:59:46Z)
Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample. We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z)
Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z)
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning. We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z)
Controlling Text-to-Image Diffusion by Orthogonal Finetuning [74.21549380288631]
We introduce a principled finetuning method -- Orthogonal Finetuning (OFT) for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.
arXiv Detail & Related papers (2023-06-12T17:59:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.