Related papers: Multimodal Latent Language Modeling with Next-Token Diffusion

Multimodal Latent Language Modeling with Next-Token Diffusion

URL: http://arxiv.org/abs/2412.08635v1
Date: Wed, 11 Dec 2024 18:57:32 GMT
Title: Multimodal Latent Language Modeling with Next-Token Diffusion
Authors: Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei,
Abstract summary: Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)<n>We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
Score: 111.93906046452125
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop $\sigma$-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10x fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.

Related papers

Kelix Technical Report [86.64551727600104]
We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.<n>Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling.
arXiv Detail & Related papers (2026-02-10T14:48:26Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Show-o2: Improved Native Unified Multimodal Models [57.34173415412808]
Show-o2 is a native unified multimodal models that leverage autoregressive modeling and flow matching.<n>Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion.
arXiv Detail & Related papers (2025-06-18T15:39:15Z)
Discrete Diffusion in Large Language and Multimodal Models: A Survey [61.86669998363359]
We provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs)<n>Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel decoding paradigm using full attention and a denoising-based generation strategy.<n>We trace the historical development of dLLMs and dMLLMs, formalize the underlying mathematical frameworks, list commonly-used modeling methods, and categorize representative models.
arXiv Detail & Related papers (2025-06-16T17:59:08Z)
DLM-One: Diffusion Language Models for One-Step Sequence Generation [63.43422118066493]
DLM-One is a score-distillation-based framework for one-step sequence generation with continuous diffusion language models.<n>We investigate whether DLM-One can achieve substantial gains in sampling efficiency for language modeling.
arXiv Detail & Related papers (2025-05-30T22:42:23Z)
On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning [14.707830064594056]
Diffusion autoencoders (DAs) use an input-dependent latent variable to capture representations alongside the diffusion process.<n>Better generative modelling is the primary goal of another class of diffusion models -- those that learn their forward (noising) process.
arXiv Detail & Related papers (2025-05-30T18:14:09Z)
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning [71.98260064022452]
We introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models.<n>Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and connector that projects visual features into the language embedding space.
arXiv Detail & Related papers (2025-05-22T17:23:26Z)
DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation [24.85655658070008]
Diffusion Transformer Autoregressive Modeling (DiTAR) is a patch-based autoregressive framework combining a language model with a diffusion transformer. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.
arXiv Detail & Related papers (2025-02-06T10:09:49Z)
Scalable Language Models with Posterior Inference of Latent Thought Vectors [52.63299874322121]
Latent-Thought Language Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space. LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space. LTMs significantly outperform conventional autoregressive models and discrete diffusion models in validation perplexity and zero-shot language modeling.
arXiv Detail & Related papers (2025-02-03T17:50:34Z)
Exploring Representation-Aligned Latent Space for Better Generation [86.45670422239317]
We introduce ReaLS, which integrates semantic priors to improve generation performance. We show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric. The enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
arXiv Detail & Related papers (2025-02-01T07:42:12Z)
Dual Diffusion for Unified Image Generation and Understanding [32.7554623473768]
We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation. We leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly. Our model attained competitive performance compared to recent unified image understanding and generation models.
arXiv Detail & Related papers (2024-12-31T05:49:00Z)
ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
Continuous visual generation requires the full-sequence diffusion-based approach.<n>We present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer.<n>We demonstrate that ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective.
arXiv Detail & Related papers (2024-12-10T18:13:20Z)
Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z)
Large AI Model Empowered Multimodal Semantic Communications [48.73159237649128]
We propose a Large AI Model-based Multimodal SC (LAMMSC) framework. We first present the Conditional-based Multimodal Alignment (MMA) that enables the transformation between multimodal and unimodal data. Then, a personalized LLM-based Knowledge Base (LKB) is proposed, which allows users to perform personalized semantic extraction or recovery. Finally, we apply the Generative adversarial network-based channel Estimation (CGE) for estimating the wireless channel state information.
arXiv Detail & Related papers (2023-09-03T19:24:34Z)
Diffusion Glancing Transformer for Parallel Sequence to Sequence Learning [52.72369034247396]
We propose the diffusion glancing transformer, which employs a modality diffusion process and residual glancing sampling. DIFFGLAT achieves better generation accuracy while maintaining fast decoding speed compared with both autoregressive and non-autoregressive models.
arXiv Detail & Related papers (2022-12-20T13:36:25Z)
Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.