Rethinking Global Text Conditioning in Diffusion Transformers
- URL: http://arxiv.org/abs/2602.09268v1
- Date: Mon, 09 Feb 2026 23:06:58 GMT
- Title: Rethinking Global Text Conditioning in Diffusion Transformers
- Authors: Nikita Starodubcev, Daniil Pakhomov, Zongze Wu, Ilya Drobyshevskiy, Yuchen Liu, Zhonghao Wang, Yuqian Zhou, Zhe Lin, Dmitry Baranchuk,
- Abstract summary: Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism.<n>Recent approaches discard modulation-based text conditioning and rely exclusively on attention.<n>This paper addresses whether modulation-based text conditioning is necessary and whether it can provide any performance advantage.
- Score: 28.353061239439587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.
Related papers
- Shifting the Breaking Point of Flow Matching for Multi-Instance Editing [47.32746672482526]
We introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations and enforces binding between instance-specific textual instructions and spatial regions.<n>Our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.
arXiv Detail & Related papers (2026-02-09T14:52:45Z) - Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization [75.88719716002014]
Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains.<n>Recent advances in pre-trained Visual Foundation Models (VFMs) have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models.<n>We propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM.
arXiv Detail & Related papers (2025-07-03T03:52:37Z) - Training-Free Text-Guided Image Editing with Visual Autoregressive Model [46.201510044410995]
We propose a novel text-guided image editing framework based on Visual AutoRegressive modeling.<n>Our method eliminates the need for explicit inversion while ensuring precise and controlled modifications.<n>Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds.
arXiv Detail & Related papers (2025-03-31T09:46:56Z) - Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation [7.218556478126324]
diffusion model has demonstrated superior performance in diverse and high-quality images for text-guided image translation.<n>We propose pix2pix-zeroCon, a zero-shot diffusion-based method that eliminates the need for additional training by leveraging patch-wise contrastive loss.<n>Our approach requires no additional training and operates directly on a pre-trained text-to-image diffusion model.
arXiv Detail & Related papers (2025-03-26T12:15:25Z) - SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing [5.123822132804602]
We introduce a skeleton-aware latent diffusion (SALAD) model that captures the intricate inter-relationships between joints, frames, and words.<n>By leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing.<n>Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality.
arXiv Detail & Related papers (2025-03-18T02:20:11Z) - ArtCrafter: Text-Image Aligning Style Transfer via Embedding Reframing [25.610375901522886]
ArtCrafter is a novel framework for text-to-image style transfer.<n>We introduce an attention-based style extraction module.<n>We also present a novel text-image aligning augmentation component.
arXiv Detail & Related papers (2025-01-03T19:17:27Z) - Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks.
ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z) - Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.
We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts.
We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Text Revision by On-the-Fly Representation Optimization [76.11035270753757]
Current state-of-the-art methods formulate these tasks as sequence-to-sequence learning problems.
We present an iterative in-place editing approach for text revision, which requires no parallel data.
It achieves competitive and even better performance than state-of-the-art supervised methods on text simplification.
arXiv Detail & Related papers (2022-04-15T07:38:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.