Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models
- URL: http://arxiv.org/abs/2412.11333v1
- Date: Sun, 15 Dec 2024 22:47:44 GMT
- Title: Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models
- Authors: Xiaochen Zhu, Georgi Karadzhov, Chenxi Whitehouse, Andreas Vlachos,
- Abstract summary: Token-level diffusion overlooks word-order dependencies and enforces short output windows, while passage-level diffusion struggles with learning robust representation for long-form text.
We propose Segment-Level Diffusion (SLD), a framework that enhances diffusion-based text generation through text segmentation, robust representation training with adversarial and contrastive learning, and improved latent-space guidance.
- Score: 12.446047799880587
- License:
- Abstract: Diffusion models have shown promise in text generation but often struggle with generating long, coherent, and contextually accurate text. Token-level diffusion overlooks word-order dependencies and enforces short output windows, while passage-level diffusion struggles with learning robust representation for long-form text. To address these challenges, we propose Segment-Level Diffusion (SLD), a framework that enhances diffusion-based text generation through text segmentation, robust representation training with adversarial and contrastive learning, and improved latent-space guidance. By segmenting long-form outputs into separate latent representations and decoding them with an autoregressive decoder, SLD simplifies diffusion predictions and improves scalability. Experiments on XSum, ROCStories, DialogSum, and DeliData demonstrate that SLD achieves competitive or superior performance in fluency, coherence, and contextual compatibility across automatic and human evaluation metrics comparing with other diffusion and autoregressive baselines. Ablation studies further validate the effectiveness of our segmentation and representation learning strategies.
Related papers
- Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models [0.7366405857677226]
Vision-Language Aligned Diffusion (VLAD) model is a generative framework that addresses challenges through a dual-stream strategy.
VLAD decomposes textual prompts into global and local representations, ensuring precise alignment with visual features.
It incorporates a multi-stage diffusion process with hierarchical guidance to generate high-fidelity images.
arXiv Detail & Related papers (2025-01-01T18:27:13Z) - Mind the Gap: A Generalized Approach for Cross-Modal Embedding Alignment [0.0]
Retrieval-Augmented Generation (RAG) systems retrieve context across different text modalities due to semantic gaps.
We introduce a generalized projection-based method, inspired by adapter modules in transfer learning, that efficiently bridges these gaps.
Our approach emphasizes speed, accuracy, and data efficiency, requiring minimal resources for training and inference.
arXiv Detail & Related papers (2024-10-30T20:28:10Z) - LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? [10.72249123249003]
We revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding.
We introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions.
LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS dataset with 38.2 BLEU@4 and 126.2 CIDEr.
arXiv Detail & Related papers (2024-04-16T17:47:16Z) - Improved Paraphrase Generation via Controllable Latent Diffusion [60.479643304122504]
We propose textitLatent textitDiffusion textitParaphraser(LDP), a novel paraphrase generation by modeling a controllable diffusion process.
Experiments show that LDP better reconciles paraphrase generation quality and diversity than baselines.
arXiv Detail & Related papers (2024-04-13T09:24:32Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - TESS: Text-to-Text Self-Conditioned Simplex Diffusion [56.881170312435444]
Text-to-text Self-conditioned Simplex Diffusion employs a new form of self-conditioning, and applies the diffusion process on the logit simplex space rather than the learned embedding space.
We demonstrate that TESS outperforms state-of-the-art non-autoregressive models, requires fewer diffusion steps with minimal drop in performance, and is competitive with pretrained autoregressive sequence-to-sequence models.
arXiv Detail & Related papers (2023-05-15T06:33:45Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - DiscoDVT: Generating Long Text with Discourse-Aware Discrete Variational
Transformer [40.10695204278747]
We propose DiscoDVT, a discourse-aware discrete variational Transformer to tackle the incoherence issue.
We conduct extensive experiments on two open story generation datasets and demonstrate that the latent codes learn meaningful correspondence to the discourse structures that guide the model to generate long texts with better long-range coherence.
arXiv Detail & Related papers (2021-10-12T13:41:06Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.