Related papers: Dynamic Concepts Personalization from Single Videos

Dynamic Concepts Personalization from Single Videos

URL: http://arxiv.org/abs/2502.14844v1
Date: Thu, 20 Feb 2025 18:53:39 GMT
Title: Dynamic Concepts Personalization from Single Videos
Authors: Rameen Abdal, Or Patashnik, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman,
Abstract summary: We introduce Set-and-Sequence, a novel framework for personalizing generative video models with dynamic concepts.<n>Our approach imposes a-temporal weight space within an architecture that does not explicitly separate spatial and temporal features.<n>Our framework embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality.
Score: 92.62863918003575
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts, i.e., entities defined not only by their appearance but also by their motion. In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)-based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an identity LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the identity LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal weight space that effectively embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality while setting a new benchmark for personalizing dynamic concepts.

Related papers

Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA [84.89284738178932]
We introduce a zero-shot framework for dynamic concept personalization in text-to-video models.<n>Our method leverages structured 2x2 video grids that spatially organize input and output pairs.<n>A dedicated Grid Fill module completes partially observed layouts, producing temporally coherent and identity preserving outputs.
arXiv Detail & Related papers (2025-07-23T22:09:38Z)
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z)
SkyReels-V2: Infinite-length Film Generative Model [35.00453687783287]
We propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. We establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement.
arXiv Detail & Related papers (2025-04-17T16:37:27Z)
JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation [13.168628936598367]
JointTuner is a framework that enables joint optimization of appearance and motion components.<n>AiT Loss disrupts the flow of appearance-related components, guiding the model to focus exclusively on motion learning.<n>JointTuner is compatible with both UNet-based models and Diffusion Transformer-based models.
arXiv Detail & Related papers (2025-03-31T11:04:07Z)
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding [48.12128042470839]
We propose an integrated Spatial-TempOral dynamic Prompting (STOP) model. It consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting. STOP consistently achieves superior performance against state-of-the-art methods.
arXiv Detail & Related papers (2025-03-20T09:16:20Z)
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models [71.9811050853964]
VideoJAM is a novel framework that instills an effective motion prior to video generators. VideoJAM achieves state-of-the-art performance in motion coherence. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
arXiv Detail & Related papers (2025-02-04T17:07:10Z)
Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models [18.41701130228042]
Motion customization aims to adapt the diffusion model (DM) to generate videos with the motion specified by a set of video clips with the same motion concept.<n>This paper proposes two novel strategies to enhance motion-appearance separation, including temporal attention purification (TAP) and appearance highway (AH)
arXiv Detail & Related papers (2025-01-28T05:40:20Z)
Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models [48.56724784226513]
We propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks.
arXiv Detail & Related papers (2024-02-22T18:38:48Z)
MotionCrafter: One-Shot Motion Customization of Diffusion Models [66.44642854791807]
We introduce MotionCrafter, a one-shot instance-guided motion customization method. MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model. During training, a frozen base model provides appearance normalization, effectively separating appearance from motion.
arXiv Detail & Related papers (2023-12-08T16:31:04Z)
Continuous-Time Video Generation via Learning Motion Dynamics with Neural ODE [26.13198266911874]
We propose a novel video generation approach that learns separate distributions for motion and appearance. We employ a two-stage approach where the first stage converts a noise vector to a sequence of keypoints in arbitrary frame rates, and the second stage synthesizes videos based on the given keypoints sequence and the appearance noise vector.
arXiv Detail & Related papers (2021-12-21T03:30:38Z)
Dance In the Wild: Monocular Human Animation with Neural Dynamic Appearance Synthesis [56.550999933048075]
We propose a video based synthesis method that tackles challenges and demonstrates high quality results for in-the-wild videos. We introduce a novel motion signature that is used to modulate the generator weights to capture dynamic appearance changes. We evaluate our method on a set of challenging videos and show that our approach achieves state-of-the art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2021-11-10T20:18:57Z)
StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN [70.31913835035206]
We present a novel approach to the video synthesis problem that helps to greatly improve visual quality. We make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for. Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes.
arXiv Detail & Related papers (2021-07-15T09:58:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.