EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation
- URL: http://arxiv.org/abs/2504.06861v1
- Date: Wed, 09 Apr 2025 13:11:09 GMT
- Title: EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation
- Authors: Diljeet Jagpal, Xi Chen, Vinay P. Namboodiri,
- Abstract summary: Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models.<n>We provide a model-agnostic approach, using intersections in diffusion trajectories, working only with latent values.<n>An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames.<n>Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models.
- Score: 26.888320234592978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image generation models, which limit their adaptability and scalability. In contrast to such methods, we provide a model-agnostic approach. We use intersections in diffusion trajectories, working only with the latent values. We could not obtain localized frame-wise coherence and diversity using only the intersection of trajectories. Thus, we instead use a grid-based approach. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Based on these, we obtain a CLIP-based attention mask that controls the timing of switching the prompts for each grid cell. Earlier switching results in higher variance, while later switching results in more coherence. Therefore, our approach can ensure appropriate control between coherence and variance for the frames. Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models. The empirical analysis using quantitative metrics and user studies confirms our model's superior temporal consistency, visual fidelity and user satisfaction, thus providing a novel way to obtain training-free, image-based text-to-video generation.
Related papers
- Unified Multimodal Discrete Diffusion [78.48930545306654]
Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches.<n>We explore discrete diffusion models as a unified generative formulation in the joint text and image domain.<n>We present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images.
arXiv Detail & Related papers (2025-03-26T17:59:51Z) - Optical-Flow Guided Prompt Optimization for Coherent Video Generation [51.430833518070145]
We propose a framework called MotionPrompt that guides the video generation process via optical flow.<n>We optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs.<n>This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content.
arXiv Detail & Related papers (2024-11-23T12:26:52Z) - TVG: A Training-free Transition Video Generation Method with Diffusion Models [12.037716102326993]
Transition videos play a crucial role in media production, enhancing the flow and coherence of visual narratives.
Recent advances in diffusion model-based video generation offer new possibilities for creating transitions but face challenges such as poor inter-frame relationship modeling and abrupt content changes.
We propose a novel training-free Transition Video Generation (TVG) approach using video-level diffusion models that addresses these limitations without additional training.
arXiv Detail & Related papers (2024-08-24T00:33:14Z) - RefDrop: Controllable Consistency in Image or Video Generation via Reference Feature Guidance [22.326405355520176]
RefDrop allows users to control the influence of reference context in a direct and precise manner.
Our method also enables more interesting applications, such as the consistent generation of multiple subjects.
arXiv Detail & Related papers (2024-05-27T21:23:20Z) - Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models [96.97910688908956]
We introduce the first zero-shot approach for Video Semantic (VSS) based on pre-trained diffusion models.
We propose a framework tailored for VSS based on pre-trained image and video diffusion models.
Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches.
arXiv Detail & Related papers (2024-05-27T08:39:38Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Streaming Radiance Fields for 3D Video Synthesis [32.856346090347174]
We present an explicit-grid based method for reconstructing streaming radiance fields for novel view synthesis of real world dynamic scenes.
Experiments on challenging video sequences demonstrate that our approach is capable of achieving a training speed of 15 seconds per-frame with competitive rendering quality.
arXiv Detail & Related papers (2022-10-26T16:23:02Z) - Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.
We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Meta Internal Learning [88.68276505511922]
Internal learning for single-image generation is a framework, where a generator is trained to produce novel images based on a single image.
We propose a meta-learning approach that enables training over a collection of images, in order to model the internal statistics of the sample image more effectively.
Our results show that the models obtained are as suitable as single-image GANs for many common image applications.
arXiv Detail & Related papers (2021-10-06T16:27:38Z) - Multi-Scale Local-Temporal Similarity Fusion for Continuous Sign
Language Recognition [4.059599144668737]
Continuous sign language recognition is a public significant task that transcribes a sign language video into an ordered gloss sequence.
One promising way is to adopt a one-dimensional convolutional network (1D-CNN) to temporally fuse the sequential frames.
We propose to adaptively fuse local features via temporal similarity for this task.
arXiv Detail & Related papers (2021-07-27T12:06:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.