Compressed and Smooth Latent Space for Text Diffusion Modeling
- URL: http://arxiv.org/abs/2506.21170v2
- Date: Fri, 17 Oct 2025 18:11:02 GMT
- Title: Compressed and Smooth Latent Space for Text Diffusion Modeling
- Authors: Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, Dmitry Vetrov,
- Abstract summary: We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion.<n>We demonstrate that text representations can be compressed by $8times$ while maintaining generation quality comparable to token-level diffusion models.<n>We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms.
- Score: 71.87805084454187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by $8\times$ while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2\times$ faster inference. Code is released at \href{https://github.com/MeshchaninovViacheslav/cosmos}{GitHub}
Related papers
- SemanticGen: Video Generation in Semantic Space [60.49729308406981]
State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder.<n>We introduce SemanticGen, a novel solution to generate videos in the semantic space.<n>Our method is also effective and computationally efficient when extended to long video generation.
arXiv Detail & Related papers (2025-12-23T18:59:56Z) - MoLingo: Motion-Language Alignment for Text-to-Motion Generation [50.33970522600594]
MoLingo is a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space.<n>We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close.<n>We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment.
arXiv Detail & Related papers (2025-12-15T19:22:40Z) - Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model [87.23753533733046]
We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities.<n>Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder.
arXiv Detail & Related papers (2025-05-29T16:15:48Z) - STRICT: Stress Test of Rendering Images Containing Text [11.236527918747925]
$textbfSTRICT$ is a benchmark designed to stress-test the ability of diffusion models to render coherent and instruction-aligned text in images.<n>We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities.
arXiv Detail & Related papers (2025-05-25T05:37:08Z) - Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation [45.560812800359685]
Smoothing Diffusion on Token Embeddings (Smoothie) is a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity.<n> Experimental results on several sequence-to-sequence generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality.<n>Our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex.
arXiv Detail & Related papers (2025-05-24T20:02:14Z) - Auto-Regressive Diffusion for Generating 3D Human-Object Interactions [5.587507490937267]
Key challenge in HOI generation is maintaining interaction consistency in long sequences.<n>We propose an autoregressive diffusion model (ARDHOI) that predicts the next continuous token.<n>Our model has been evaluated on the OMOMO and BEHAVE datasets.
arXiv Detail & Related papers (2025-03-21T02:25:59Z) - Precise Parameter Localization for Textual Generation in Diffusion Models [7.057901456502796]
Novel diffusion models can synthesize photo-realistic images with integrated high-quality text.<n>We show through attention activation patching that only less than 1% of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images.<n>We introduce several applications that benefit from localizing the layers responsible for textual content generation.
arXiv Detail & Related papers (2025-02-14T06:11:23Z) - TextCraftor: Your Text Encoder Can be Image Quality Controller [65.27457900325462]
Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation.
We propose a proposed fine-tuning approach, TextCraftor, to enhance the performance of text-to-image diffusion models.
arXiv Detail & Related papers (2024-03-27T19:52:55Z) - Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation [138.98095392584693]
We introduce Auto-Regressive Diffusion (AR-Diffusion) to account for the inherent sequential characteristic of natural language.
AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps.
In a series of experiments on various text generation tasks, AR-Diffusion clearly demonstrated its superiority over existing diffusion language models.
arXiv Detail & Related papers (2023-05-16T15:10:22Z) - TESS: Text-to-Text Self-Conditioned Simplex Diffusion [56.881170312435444]
Text-to-text Self-conditioned Simplex Diffusion employs a new form of self-conditioning, and applies the diffusion process on the logit simplex space rather than the learned embedding space.
We demonstrate that TESS outperforms state-of-the-art non-autoregressive models, requires fewer diffusion steps with minimal drop in performance, and is competitive with pretrained autoregressive sequence-to-sequence models.
arXiv Detail & Related papers (2023-05-15T06:33:45Z) - SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
In this work, we apply diffusion models to approach sequence-to-sequence text generation.
We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation.
Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
arXiv Detail & Related papers (2022-12-20T15:16:24Z) - Self-conditioned Embedding Diffusion for Text Generation [28.342735885752493]
Self-conditioned Embedding Diffusion is a continuous diffusion mechanism that operates on token embeddings.
We show that our text diffusion models generate samples comparable with those produced by standard autoregressive language models.
arXiv Detail & Related papers (2022-11-08T13:30:27Z) - PLANET: Dynamic Content Planning in Autoregressive Transformers for
Long-form Text Generation [47.97523895218194]
We propose a novel generation framework leveraging autoregressive self-attention mechanism to conduct content planning and surface realization dynamically.
Our framework enriches the Transformer decoder with latent representations to maintain sentence-level semantic plans grounded by bag-of-words.
arXiv Detail & Related papers (2022-03-17T05:52:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.