Related papers: SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

URL: http://arxiv.org/abs/2412.10533v1
Date: Fri, 13 Dec 2024 20:01:51 GMT
Title: SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner
Authors: Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Nanxuan Zhao, Jing Shi, Tong Sun,
Abstract summary: We present SUGAR, a zero-shot method for subject-driven video customization.<n>Given an input image, SUGAR is capable of generating videos for the subject and aligning the generation with arbitrary visual attributes.
Score: 46.75063691424628
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.

Related papers

Subject-driven Video Generation via Disentangled Identity and Motion [52.54835936914813]
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings.
arXiv Detail & Related papers (2025-04-23T06:48:31Z)
Multi-subject Open-set Personalization in Video Generation [110.02124633005516]
We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt. Our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-01-10T18:59:54Z)
CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training [35.43906754134253]
We propose CustomTTT, where we can joint custom the appearance and motion of the given video easily. Since each LoRA is trained individually, we propose a novel test-time training technique to update parameters after combination. Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2024-12-20T08:05:13Z)
DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control [48.41743234012456]
DisenStudio is a novel framework that can generate text-guided videos for customized multiple subjects. DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics.
arXiv Detail & Related papers (2024-05-21T13:44:55Z)
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z)
NewMove: Customizing text-to-video models with novel motions [74.9442859239997]
We introduce an approach for augmenting text-to-video generation models with customized motions.<n>By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios.
arXiv Detail & Related papers (2023-12-07T18:59:03Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer [13.098901971644656]
This paper proposes a zero-shot video stylization method named Style-A-Video. Uses a generative pre-trained transformer with an image latent diffusion model to achieve a concise text-controlled video stylization. Tests show that we can attain superior content preservation and stylistic performance while incurring less consumption than previous solutions.
arXiv Detail & Related papers (2023-05-09T14:03:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.