Frame In-N-Out: Unbounded Controllable Image-to-Video Generation
- URL: http://arxiv.org/abs/2505.21491v1
- Date: Tue, 27 May 2025 17:56:07 GMT
- Title: Frame In-N-Out: Unbounded Controllable Image-to-Video Generation
- Authors: Boyang Wang, Xuweiyi Chen, Matheus Gadelha, Zezhou Cheng,
- Abstract summary: Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation.<n>We focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out.<n>We introduce a new dataset curated semi-automatically, a comprehensive evaluation protocol targeting this setting, and an efficient identity-preserving motion-controllable video Diffusion Transformer architecture.
- Score: 12.556320730925702
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by user-specified motion trajectory. To support this task, we introduce a new dataset curated semi-automatically, a comprehensive evaluation protocol targeting this setting, and an efficient identity-preserving motion-controllable video Diffusion Transformer architecture. Our evaluation shows that our proposed approach significantly outperforms existing baselines.
Related papers
- Compositional Video Synthesis by Temporal Object-Centric Learning [3.2228025627337864]
We present a novel framework for compositional video synthesis that leverages temporally consistent object-centric representations.<n>Our approach explicitly captures temporal dynamics by learning pose invariant object-centric slots and conditioning them on pretrained diffusion models.<n>This design enables high-quality, pixel-level video synthesis with superior temporal coherence.
arXiv Detail & Related papers (2025-07-28T14:11:04Z) - Motion-Aware Concept Alignment for Consistent Video Editing [57.08108545219043]
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
arXiv Detail & Related papers (2025-06-01T13:28:04Z) - Enhancing Self-Supervised Fine-Grained Video Object Tracking with Dynamic Memory Prediction [5.372301053935416]
We introduce a Dynamic Memory Prediction framework that utilizes multiple reference frames to concisely enhance frame reconstruction.<n>Our algorithm outperforms the state-of-the-art self-supervised techniques on two fine-grained video object tracking tasks.
arXiv Detail & Related papers (2025-04-30T14:29:04Z) - Subject-driven Video Generation via Disentangled Identity and Motion [52.54835936914813]
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning.<n>Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings.
arXiv Detail & Related papers (2025-04-23T06:48:31Z) - Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z) - CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers [15.558659099600822]
CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features.<n>We propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features.<n> Experimental results show that CustomVideoX significantly outperforms existing methods in terms of video consistency and quality.
arXiv Detail & Related papers (2025-02-10T14:50:32Z) - Multi-subject Open-set Personalization in Video Generation [110.02124633005516]
We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities.<n>Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt.<n>Our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-01-10T18:59:54Z) - MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing [90.06041718086317]
We propose a unified Multi-alignment Diffusion, dubbed as MagDiff, for both tasks of high-fidelity video generation and editing.
The proposed MagDiff introduces three types of alignments, including subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment.
arXiv Detail & Related papers (2023-11-29T03:36:07Z) - Aggregating Nearest Sharp Features via Hybrid Transformers for Video Deblurring [70.06559269075352]
We propose a video deblurring method that leverages both neighboring frames and existing sharp frames using hybrid Transformers for feature aggregation.<n>To aggregate nearest sharp features from detected sharp frames, we utilize a global Transformer with multi-scale matching capability.<n>Our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality.
arXiv Detail & Related papers (2023-09-13T16:12:11Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Unfolding a blurred image [36.519356428362286]
We learn motion representation from sharp videos in an unsupervised manner.
We then train a convolutional recurrent video autoencoder network that performs a surrogate task of video reconstruction.
It is employed for guided training of a motion encoder for blurred images.
This network extracts embedded motion information from the blurred image to generate a sharp video in conjunction with the trained recurrent video decoder.
arXiv Detail & Related papers (2022-01-28T09:39:55Z) - Siamese Network with Interactive Transformer for Video Object
Segmentation [34.202137199782804]
We propose a network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames.
We employ the backbone architecture to extract backbone features of both past and current frames, which enables feature reuse and is more efficient than existing methods.
arXiv Detail & Related papers (2021-12-28T03:38:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.