Related papers: Bring Your Dreams to Life: Continual Text-to-Video Customization

Bring Your Dreams to Life: Continual Text-to-Video Customization

URL: http://arxiv.org/abs/2512.05802v2
Date: Wed, 10 Dec 2025 08:51:47 GMT
Title: Bring Your Dreams to Life: Continual Text-to-Video Customization
Authors: Jiahua Dong, Xudong Wang, Wenqi Liang, Zongyan Han, Meng Cao, Duzhen Zhang, Hanbin Zhao, Zhi Han, Salman Khan, Fahad Shahbaz Khan,
Abstract summary: We develop a novel Continual Customized Video Diffusion model to tackle forgetting and concept neglect.<n>To tackle concept neglect, we develop a controllable conditional synthesis to enhance regional features and align video contexts with user conditions.<n>Our CCVD outperforms existing CTVG baselines on both the DreamVideo and Wan 2.1 backbones.
Score: 76.70414091514704
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Customized text-to-video generation (CTVG) has recently witnessed great progress in generating tailored videos from user-specific text. However, most CTVG methods assume that personalized concepts remain static and do not expand incrementally over time. Additionally, they struggle with forgetting and concept neglect when continuously learning new concepts, including subjects and motions. To resolve the above challenges, we develop a novel Continual Customized Video Diffusion (CCVD) model, which can continuously learn new concepts to generate videos across various text-to-video generation tasks by tackling forgetting and concept neglect. To address catastrophic forgetting, we introduce a concept-specific attribute retention module and a task-aware concept aggregation strategy. They can capture the unique characteristics and identities of old concepts during training, while combining all subject and motion adapters of old concepts based on their relevance during testing. Besides, to tackle concept neglect, we develop a controllable conditional synthesis to enhance regional features and align video contexts with user conditions, by incorporating layer-specific region attention-guided noise estimation. Extensive experimental comparisons demonstrate that our CCVD outperforms existing CTVG baselines on both the DreamVideo and Wan 2.1 backbones. The code is available at https://github.com/JiahuaDong/CCVD.

Related papers

Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA [84.89284738178932]
We introduce a zero-shot framework for dynamic concept personalization in text-to-video models.<n>Our method leverages structured 2x2 video grids that spatially organize input and output pairs.<n>A dedicated Grid Fill module completes partially observed layouts, producing temporally coherent and identity preserving outputs.
arXiv Detail & Related papers (2025-07-23T22:09:38Z)
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction [65.15449703659772]
Video Object (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames.<n>We propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations.<n>SeC achieves an 11.8-point improvement over SAM SeCVOS, establishing a new state-of-the-art concept-aware video object segmentation.
arXiv Detail & Related papers (2025-07-21T17:59:02Z)
T2VUnlearning: A Concept Erasing Method for Text-to-Video Diffusion Models [10.59080421751043]
Recent advances in text-to-video (T2V) diffusion models have significantly enhanced the quality of generated videos.<n>Their capability to produce explicit or harmful content introduces new challenges related to misuse and potential rights violations.<n>We propose unlearning-based concept erasing as a solution.
arXiv Detail & Related papers (2025-05-23T06:56:32Z)
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval [26.40393400497247]
Video retrieval requires aligning visual content with corresponding natural language descriptions.<n>In this paper, we introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR)<n>We propose to align modalities in a latent space, along with learning and aligning auxiliary latent concepts.
arXiv Detail & Related papers (2025-04-02T10:56:01Z)
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning [40.70596166863986]
Multi-Concept Video Customization (MCVC) remains a significant challenge.<n>We introduce ConceptMaster, a novel framework that effectively addresses the identity decoupling issues.<n>We show that ConceptMaster significantly outperforms previous methods for video customization tasks.
arXiv Detail & Related papers (2025-01-08T18:59:01Z)
How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization? [91.49559116493414]
We propose a novel Concept-Incremental text-to-image Diffusion Model (CIDM) It can resolve catastrophic forgetting and concept neglect to learn new customization tasks in a concept-incremental manner. Experiments validate that our CIDM surpasses existing custom diffusion models.
arXiv Detail & Related papers (2024-10-23T06:47:29Z)
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities [56.5742116979914]
CustomCrafter preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery.<n>For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising.<n>In the later stage of denoising, we restore this module to repair the appearance details of the specified subject.
arXiv Detail & Related papers (2024-08-23T17:26:06Z)
Non-confusing Generation of Customized Concepts in Diffusion Models [135.4385383284657]
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs) Existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one. We propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning.
arXiv Detail & Related papers (2024-05-11T05:01:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.