Related papers: Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA

Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA

URL: http://arxiv.org/abs/2507.17963v1
Date: Wed, 23 Jul 2025 22:09:38 GMT
Title: Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA
Authors: Rameen Abdal, Or Patashnik, Ekaterina Deyneka, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman,
Abstract summary: We introduce a zero-shot framework for dynamic concept personalization in text-to-video models.<n>Our method leverages structured 2x2 video grids that spatially organize input and output pairs.<n>A dedicated Grid Fill module completes partially observed layouts, producing temporally coherent and identity preserving outputs.
Score: 84.89284738178932
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in text-to-video generation have enabled high-quality synthesis from text and image prompts. While the personalization of dynamic concepts, which capture subject-specific appearance and motion from a single video, is now feasible, most existing methods require per-instance fine-tuning, limiting scalability. We introduce a fully zero-shot framework for dynamic concept personalization in text-to-video models. Our method leverages structured 2x2 video grids that spatially organize input and output pairs, enabling the training of lightweight Grid-LoRA adapters for editing and composition within these grids. At inference, a dedicated Grid Fill module completes partially observed layouts, producing temporally coherent and identity preserving outputs. Once trained, the entire system operates in a single forward pass, generalizing to previously unseen dynamic concepts without any test-time optimization. Extensive experiments demonstrate high-quality and consistent results across a wide range of subjects beyond trained concepts and editing scenarios.

Related papers

Compositional Video Synthesis by Temporal Object-Centric Learning [3.2228025627337864]
We present a novel framework for compositional video synthesis that leverages temporally consistent object-centric representations.<n>Our approach explicitly captures temporal dynamics by learning pose invariant object-centric slots and conditioning them on pretrained diffusion models.<n>This design enables high-quality, pixel-level video synthesis with superior temporal coherence.
arXiv Detail & Related papers (2025-07-28T14:11:04Z)
DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation [14.34140569565309]
DyST-XL is a framework that enhances off-the-shelf text-to-video models through frame-aware control.<n>The code is released in https://github.com/XiaoBuL/DyST-XL.
arXiv Detail & Related papers (2025-04-21T11:41:22Z)
SkyReels-V2: Infinite-length Film Generative Model [35.00453687783287]
We propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework.<n>We establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement.
arXiv Detail & Related papers (2025-04-17T16:37:27Z)
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding [48.12128042470839]
We propose an integrated Spatial-TempOral dynamic Prompting (STOP) model.<n>It consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting.<n>STOP consistently achieves superior performance against state-of-the-art methods.
arXiv Detail & Related papers (2025-03-20T09:16:20Z)
Dynamic Concepts Personalization from Single Videos [92.62863918003575]
We introduce Set-and-Sequence, a novel framework for personalizing generative video models with dynamic concepts.<n>Our approach imposes a-temporal weight space within an architecture that does not explicitly separate spatial and temporal features.<n>Our framework embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality.
arXiv Detail & Related papers (2025-02-20T18:53:39Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation [97.96178992465511]
We argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics.
arXiv Detail & Related papers (2024-06-12T21:41:32Z)
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)
Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame" A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame. IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z)
Editable Free-viewpoint Video Using a Layered Neural Representation [35.44420164057911]
We propose the first approach for editable free-viewpoint video generation for large-scale dynamic scenes using only sparse 16 cameras. The core of our approach is a new layered neural representation, where each dynamic entity including the environment itself is formulated into a space-time coherent neural layered radiance representation called ST-NeRF. Experiments demonstrate the effectiveness of our approach to achieve high-quality, photo-realistic, and editable free-viewpoint video generation for dynamic scenes.
arXiv Detail & Related papers (2021-04-30T06:50:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.