Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation
- URL: http://arxiv.org/abs/2412.07750v2
- Date: Fri, 07 Mar 2025 18:46:34 GMT
- Title: Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation
- Authors: Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, Gal Chechik,
- Abstract summary: We investigate how self-attention query features govern motion, structure, and identity in text-to-video models.<n>Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity.<n>We demonstrate two applications: (1) a zero-shot motion transfer method that is 20 times more efficient than existing approaches, and (2) a training-free technique for consistent multi-shot video generation.
- Score: 47.61288672890036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query features (a.k.a. Q features) simultaneously govern motion, structure, and identity and examine the challenges arising when these representations interact. Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity, making it hard to transfer motion without the side-effect of transferring identity. Understanding this dual role enabled us to control query feature injection (Q injection) and demonstrate two applications: (1) a zero-shot motion transfer method that is 20 times more efficient than existing approaches, and (2) a training-free technique for consistent multi-shot video generation, where characters maintain identity across multiple video shots while Q injection enhances motion fidelity.
Related papers
- VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.
We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.
VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2025-03-19T11:59:14Z) - SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers [30.06494915665044]
We present SkyReels-A1, a framework built upon video diffusion Transformer to facilitate portrait image animation.
SkyReels-A1 capitalizes on the strong generative capabilities of video DiT, enhancing facial motion transfer precision, identity retention, and temporal coherence.
It is highly applicable to domains such as virtual avatars, remote communication, and digital media generation.
arXiv Detail & Related papers (2025-02-15T16:08:40Z) - MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.
MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z) - VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation [70.61101071902596]
Current generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos.<n>We propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation.<n>Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation [7.474418338825595]
MotionCharacter is an efficient and high-fidelity human video generation framework.
We introduce an ID-preserving module to maintain identity fidelity while allowing flexible attribute modifications.
We also introduce ID-consistency and region-aware loss mechanisms, significantly enhancing identity consistency and detail fidelity.
arXiv Detail & Related papers (2024-11-27T12:15:52Z) - One-Shot Learning Meets Depth Diffusion in Multi-Object Videos [0.0]
This paper introduces a novel depth-conditioning approach that enables the generation of coherent and diverse videos from just a single text-video pair.
Our method fine-tunes the pre-trained model to capture continuous motion by employing custom-designed spatial and temporal attention mechanisms.
During inference, we use the DDIM inversion to provide structural guidance for video generation.
arXiv Detail & Related papers (2024-08-29T16:58:10Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding [24.486705010561067]
The paper introduces AniTalker, a framework designed to generate lifelike talking faces from a single portrait.
AniTalker effectively captures a wide range of facial dynamics, including subtle expressions and head movements.
arXiv Detail & Related papers (2024-05-06T02:32:41Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects.
We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z) - LatentMan: Generating Consistent Animated Characters using Image Diffusion Models [44.18315132571804]
We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models.
Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference.
arXiv Detail & Related papers (2023-12-12T10:07:37Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - VideoDreamer: Customized Multi-Subject Text-to-Video Generation with
Disen-Mix Finetuning [47.61090084143284]
VideoDreamer can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects.
The video generator is further customized for the given multiple subjects by the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy.
arXiv Detail & Related papers (2023-11-02T04:38:50Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - Make-Your-Video: Customized Video Generation Using Textual and
Structural Guidance [36.26032505627126]
Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only.
In this paper, we explore customized video generation by utilizing text as context description and motion structure.
Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model.
arXiv Detail & Related papers (2023-06-01T17:43:27Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.