PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
- URL: http://arxiv.org/abs/2412.00596v1
- Date: Sat, 30 Nov 2024 22:02:12 GMT
- Title: PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
- Authors: Qiyao Xue, Xiangyu Yin, Boyuan Yang, Wei Gao,
- Abstract summary: We present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains.
Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers.
- Score: 4.98706730396778
- License:
- Abstract: Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: https://github.com/pittisl/PhyT2V.
Related papers
- TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On [78.33688031340698]
TED-VITON is a novel framework that integrates a Garment Semantic (GS) Adapter for enhancing garment-specific features.
These innovations enable state-of-the-art (SOTA) performance in visual quality and text fidelity.
arXiv Detail & Related papers (2024-11-26T01:00:09Z) - T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design [79.7289790249621]
Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals.
We highlight the crucial importance of tailoring datasets to specific learning objectives.
We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver.
arXiv Detail & Related papers (2024-10-08T04:30:06Z) - VideoElevator: Elevating Video Generation Quality with Versatile
Text-to-Image Diffusion Models [94.25084162939488]
Text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment.
We introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I.
arXiv Detail & Related papers (2024-03-08T16:44:54Z) - I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models [80.32562822058924]
Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image.
I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism.
Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos.
arXiv Detail & Related papers (2023-12-27T19:11:50Z) - Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models [66.12367865049572]
Latent Diffusion Models (LDMs) are renowned for their powerful capabilities in image and video synthesis.
We propose FLDM, a framework that achieves high-quality text-to-video (T2V) editing by integrating various T2I and T2V LDMs.
This paper is the first to reveal that T2I and T2V LDMs can complement each other in terms of structure and temporal consistency.
arXiv Detail & Related papers (2023-10-25T06:35:01Z) - SimDA: Simple Diffusion Adapter for Efficient Video Generation [102.90154301044095]
We propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.
In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning.
arXiv Detail & Related papers (2023-08-18T17:58:44Z) - T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for
Text-to-Image Diffusion Models [29.280739915676737]
We learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals.
Our T2I-Adapter has promising generation quality and a wide range of applications.
arXiv Detail & Related papers (2023-02-16T17:56:08Z) - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.