RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation
- URL: http://arxiv.org/abs/2511.04317v1
- Date: Thu, 06 Nov 2025 12:42:03 GMT
- Title: RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation
- Authors: Xiangjun Zhang, Litong Gong, Yinglin Zheng, Yansong Liu, Wentao Jiang, Mingyi Xu, Biao Wang, Tiezheng Ge, Ming Zeng,
- Abstract summary: We introduce RISE-T2V, which uniquely integrates the processes of prompt rephrasing and semantic feature extraction into a single step.<n>We propose an innovative module called the Rephrasing Adapter, enabling diffusion models to utilize text hidden states.
- Score: 19.127189099122244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most text-to-video(T2V) diffusion models depend on pre-trained text encoders for semantic alignment, yet they often fail to maintain video quality when provided with concise prompts rather than well-designed ones. The primary issue lies in their limited textual semantics understanding. Moreover, these text encoders cannot rephrase prompts online to better align with user intentions, which limits both the scalability and usability of the models, To address these challenges, we introduce RISE-T2V, which uniquely integrates the processes of prompt rephrasing and semantic feature extraction into a single and seamless step instead of two separate steps. RISE-T2V is universal and can be applied to various pre-trained LLMs and video diffusion models(VDMs), significantly enhancing their capabilities for T2V tasks. We propose an innovative module called the Rephrasing Adapter, enabling diffusion models to utilize text hidden states during the next token prediction of the LLM as a condition for video generation. By employing a Rephrasing Adapter, the video generation model can implicitly rephrase basic prompts into more comprehensive representations that better match the user's intent. Furthermore, we leverage the powerful capabilities of LLMs to enable video generation models to accomplish a broader range of T2V tasks. Extensive experiments demonstrate that RISE-T2V is a versatile framework applicable to different video diffusion model architectures, significantly enhancing the ability of T2V models to generate high-quality videos that align with user intent. Visual results are available on the webpage at https://rise-t2v.github.io.
Related papers
- Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing [21.525921468472685]
We present a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing.<n>Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions.<n>We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation.
arXiv Detail & Related papers (2026-02-09T15:56:05Z) - TRANSPORTER: Transferring Visual Semantics from VLM Manifolds [56.749972238005604]
This paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos.<n> TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces.<n>In turn, logit scores define embedding directions for conditional video generation.
arXiv Detail & Related papers (2025-11-23T09:12:48Z) - Mimir: Improving Video Diffusion Models for Precise Text Understanding [53.72393225042688]
Text serves as the key control signal in video generation due to its narrative nature.<n>The recent success of large language models (LLMs) showcases the power of decoder-only transformers.<n>This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser.
arXiv Detail & Related papers (2024-12-04T07:26:44Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models [42.891427362223176]
Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities.<n>We propose a novel framework to fully harness the capabilities of LLMs.<n>We further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework.
arXiv Detail & Related papers (2024-06-17T17:59:43Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Mini-DALLE3: Interactive Text to Image by Prompting Large Language
Models [71.49054220807983]
A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions.
Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I)
We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
arXiv Detail & Related papers (2023-10-11T16:53:40Z) - DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot
Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos.
We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training.
The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z) - GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation [143.81719619351335]
Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions.
The tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade.
We propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model.
arXiv Detail & Related papers (2023-03-17T15:37:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.