Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance
- URL: http://arxiv.org/abs/2503.18386v1
- Date: Mon, 24 Mar 2025 06:53:08 GMT
- Title: Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance
- Authors: Sicong Feng, Jielong Yang, Li Peng,
- Abstract summary: Mask-guided video generation can control video generation through mask motion sequences.<n>Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control.<n>This approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality.
- Score: 2.5941932242768457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in diffusion models bring new vitality to visual content creation. However, current text-to-video generation models still face significant challenges such as high training costs, substantial data requirements, and difficulties in maintaining consistency between given text and motion of the foreground object. To address these challenges, we propose mask-guided video generation, which can control video generation through mask motion sequences, while requiring limited training data. Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control. Through mask motion sequences, we guide the video generation process to maintain consistent foreground objects throughout the sequence. Additionally, through a first-frame sharing strategy and autoregressive extension approach, we achieve more stable and longer video generation. Extensive qualitative and quantitative experiments demonstrate that this approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality. Our generated results can be viewed in the supplementary materials.
Related papers
- RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism [73.38167494118746]
We propose a framework to improve the realism of motion in generated videos.
We advocate for the incorporation of a retrieval mechanism during the generation phase.
Our pipeline is designed to apply to any text-to-video diffusion model.
arXiv Detail & Related papers (2025-04-09T08:14:05Z) - CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance [34.345125922868]
We propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM)
Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort.
Our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation.
arXiv Detail & Related papers (2025-03-13T14:07:58Z) - VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control [47.34885131252508]
Video inpainting aims to restore corrupted video content.<n>We propose a novel dual-stream paradigm VideoPainter to process masked videos.<n>We also introduce a novel target region ID resampling technique that enables any-length video inpainting.
arXiv Detail & Related papers (2025-03-07T17:59:46Z) - ASurvey: Spatiotemporal Consistency in Video Generation [72.82267240482874]
Video generation schemes by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC)<n>Recent works have aimed at addressing thetemporal consistency issue in video generation, while few literature review has been organized from this perspective.<n>We systematically review recent advances in video generation, covering five key aspects: foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics.
arXiv Detail & Related papers (2025-02-25T05:20:51Z) - VideoTetris: Towards Compositional Text-to-Video Generation [45.395598467837374]
VideoTetris is a framework that enables compositional T2V generation.
We show that VideoTetris achieves impressive qualitative and quantitative results in T2V generation.
arXiv Detail & Related papers (2024-06-06T17:25:33Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models [43.46536102838717]
VideoDreamer is a novel framework for customized multi-subject text-to-video generation.
It can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects.
arXiv Detail & Related papers (2023-11-02T04:38:50Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.