LivePhoto: Real Image Animation with Text-guided Motion Control
- URL: http://arxiv.org/abs/2312.02928v1
- Date: Tue, 5 Dec 2023 17:59:52 GMT
- Title: LivePhoto: Real Image Animation with Text-guided Motion Control
- Authors: Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu Liu, Yujun Shen,
Hengshuang Zhao
- Abstract summary: This work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions.
We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input.
We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions.
- Score: 51.31418077586208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the recent progress in text-to-video generation, existing studies
usually overlook the issue that only spatial contents but not temporal motions
in synthesized videos are under the control of text. Towards such a challenge,
this work presents a practical system, named LivePhoto, which allows users to
animate an image of their interest with text descriptions. We first establish a
strong baseline that helps a well-learned text-to-image generator (i.e., Stable
Diffusion) take an image as a further input. We then equip the improved
generator with a motion module for temporal modeling and propose a carefully
designed training pipeline to better link texts and motions. In particular,
considering the facts that (1) text can only describe motions roughly (e.g.,
regardless of the moving speed) and (2) text may include both content and
motion descriptions, we introduce a motion intensity estimation module as well
as a text re-weighting module to reduce the ambiguity of text-to-motion
mapping. Empirical evidence suggests that our approach is capable of well
decoding motion-related textual instructions into videos, such as actions,
camera movements, or even conjuring new contents from thin air (e.g., pouring
water into an empty glass). Interestingly, thanks to the proposed intensity
learning mechanism, our system offers users an additional control signal (i.e.,
the motion intensity) besides text for video customization.
Related papers
- Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning [26.44634685830323]
We propose a novel framework called DEcomposed MOtion (DEMO) to enhance motion synthesis in Text-to-Video (T2V) generation.
Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms.
We demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality.
arXiv Detail & Related papers (2024-10-31T17:59:53Z) - Unimotion: Unifying 3D Human Motion Synthesis and Understanding [47.18338511861108]
We introduce Unimotion, the first unified multi-task human motion model capable of both flexible motion control and frame-level motion understanding.
Unimotion allows to control motion with global text, or local frame-level text, or both at once, providing more flexible control for users.
arXiv Detail & Related papers (2024-09-24T09:20:06Z) - Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion [9.134743677331517]
We propose a pre-trained image-to-video model to disentangle appearance from motion.
Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input.
By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity.
Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks.
arXiv Detail & Related papers (2024-08-01T10:55:20Z) - Text-Animator: Controllable Visual Text Video Generation [149.940821790235]
We propose an innovative approach termed Text-Animator for visual text video generation.
Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos.
We also develop a camera control module and a text refinement module to improve the stability of generated visual text.
arXiv Detail & Related papers (2024-06-25T17:59:41Z) - Dynamic Typography: Bringing Text to Life via Video Diffusion Prior [73.72522617586593]
We present an automated text animation scheme, termed "Dynamic Typography"
It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts.
Our technique harnesses vector graphics representations and an end-to-end optimization-based framework.
arXiv Detail & Related papers (2024-04-17T17:59:55Z) - Motion Generation from Fine-grained Textual Descriptions [29.033358642532722]
We build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D.
We design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information.
Our evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines.
arXiv Detail & Related papers (2024-03-20T11:38:30Z) - Story-to-Motion: Synthesizing Infinite and Controllable Character
Animation from Long Text [14.473103773197838]
A new task, Story-to-Motion, arises when characters are required to perform specific motions based on a long text description.
Previous works in character control and text-to-motion have addressed related aspects, yet a comprehensive solution remains elusive.
We propose a novel system that generates controllable, infinitely long motions and trajectories aligned with the input text.
arXiv Detail & Related papers (2023-11-13T16:22:38Z) - Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation
with Wordless Training [178.09150600453205]
In this paper, we investigate offline open-vocabulary text-to-motion generation in a zero-shot learning manner.
Inspired by the prompt learning in NLP, we pretrain a motion generator that learns to reconstruct the full motion from the masked motion.
Our method reformulates the input text into a masked motion as the prompt for the motion generator to reconstruct'' the motion.
arXiv Detail & Related papers (2022-10-28T06:20:55Z) - Animating Pictures with Eulerian Motion Fields [90.30598913855216]
We show a fully automatic method for converting a still image into a realistic animated looping video.
We target scenes with continuous fluid motion, such as flowing water and billowing smoke.
We propose a novel video looping technique that flows features both forward and backward in time and then blends the results.
arXiv Detail & Related papers (2020-11-30T18:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.