The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation
- URL: http://arxiv.org/abs/2503.04606v3
- Date: Tue, 29 Apr 2025 10:34:28 GMT
- Title: The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation
- Authors: Aoxiong Yin, Kai Shen, Yichong Leng, Xu Tan, Xinyu Zhou, Juncheng Li, Siliang Tang,
- Abstract summary: LanDiff is a hybrid framework that synergizes the strengths of autoregressive language models and diffusion models.<n>Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D representations through efficient semantic compression; (2) a language model that generates semantic tokens with high-level semantic relationships; and (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos.
- Score: 53.837937703425794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a $\sim$14,000$\times$ compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Kling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.
Related papers
- Dual Diffusion for Unified Image Generation and Understanding [32.7554623473768]
We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation.<n>We leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly.<n>Our model attained competitive performance compared to recent unified image understanding and generation models.
arXiv Detail & Related papers (2024-12-31T05:49:00Z) - DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models [72.24305287508474]
We introduce DiCoDe, a novel approach to generate videos with a language model in an autoregressive manner.<n>By treating videos as temporal sequences, DiCoDe fully harnesses the capabilities of language models for autoregressive generation.<n>We evaluate DiCoDe both quantitatively and qualitatively, demonstrating that it performs comparably to existing methods in terms of quality.
arXiv Detail & Related papers (2024-12-05T18:57:06Z) - HunyuanVideo: A Systematic Framework For Large Video Generative Models [82.4392082688739]
HunyuanVideo is an innovative open-source video foundation model.<n>It incorporates data curation, advanced architectural design, progressive model scaling and training.<n>As a result, we successfully trained a video generative model with over 13 billion parameters.
arXiv Detail & Related papers (2024-12-03T23:52:37Z) - Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions.
We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms.
Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z) - Photorealistic Video Generation with Diffusion Models [44.95407324724976]
W.A.L.T. is a transformer-based approach for video generation via diffusion modeling.
We use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities.
We also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 times $ resolution at $8$ frames per second.
arXiv Detail & Related papers (2023-12-11T18:59:57Z) - Breathing New Life into 3D Assets with Generative Repainting [74.80184575267106]
Diffusion-based text-to-image models ignited immense attention from the vision community, artists, and content creators.
Recent works have proposed various pipelines powered by the entanglement of diffusion models and neural fields.
We explore the power of pretrained 2D diffusion models and standard 3D neural radiance fields as independent, standalone tools.
Our pipeline accepts any legacy renderable geometry, such as textured or untextured meshes, and orchestrates the interaction between 2D generative refinement and 3D consistency enforcement tools.
arXiv Detail & Related papers (2023-09-15T16:34:51Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - Grafting Pre-trained Models for Multimodal Headline Generation [12.063053852096514]
Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos.
Previous researches on pre-trained language models and video-language models have achieved significant progress in related downstream tasks.
We propose a novel approach to graft the video encoder from the pre-trained video-language model on the generative pre-trained language model.
arXiv Detail & Related papers (2022-11-14T08:59:59Z) - Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models.
A collection of pretrained encoders perceive diverse modalities (such as vision, and language)
We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.