Layer-Aware Video Composition via Split-then-Merge
- URL: http://arxiv.org/abs/2511.20809v1
- Date: Tue, 25 Nov 2025 19:53:15 GMT
- Title: Layer-Aware Video Composition via Split-then-Merge
- Authors: Ozgur Kara, Yujia Chen, Ming-Hsuan Yang, James M. Rehg, Wen-Sheng Chu, Du Tran,
- Abstract summary: Split-then-Merge (StM) is a framework designed to enhance control in generative video composition.<n>StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes.
- Score: 55.12521724893102
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io
Related papers
- BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [56.98981194478512]
We propose a unified framework that handles a broad range of subject-to-video scenarios.<n>We introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities.<n>Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos.
arXiv Detail & Related papers (2025-10-01T02:41:11Z) - GenCompositor: Generative Video Compositing with Diffusion Transformer [68.00271033575736]
Traditional pipelines require intensive labor efforts and expert collaboration, resulting in lengthy production cycles and high manpower costs.<n>This new task strives to adaptively inject identity and motion information of foreground video to the target video in an interactive manner.<n>Experiments demonstrate that our method effectively realizes generative video compositing, outperforming existing possible solutions in fidelity and consistency.
arXiv Detail & Related papers (2025-09-02T16:10:13Z) - OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions [77.04071342405055]
We develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video.<n>We also propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE)<n>Our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-06-29T18:43:00Z) - MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement [47.064467920954776]
We introduce MAGREF, a unified and effective framework for any-reference video generation.<n>Our approach incorporates masked guidance and a subject disentanglement mechanism.<n>Experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches.
arXiv Detail & Related papers (2025-05-29T17:58:15Z) - Video Decomposition Prior: A Methodology to Decompose Videos into Layers [74.36790196133505]
This paper introduces a novel video decomposition prior VDP' framework which derives inspiration from professional video editing practices.<n>VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels.<n>We address tasks such as video object segmentation, dehazing, and relighting.
arXiv Detail & Related papers (2024-12-06T10:35:45Z) - Multi-entity Video Transformers for Fine-Grained Video Representation Learning [34.26732761916984]
We re-examine the design of transformer architectures for video representation learning.<n>A key aspect of our approach is the improved sharing of scene information in the temporal pipeline.<n>Our Multi-entity Video Transformer (MV-Former) processes the frames as groups of entities represented as tokens linked across time.
arXiv Detail & Related papers (2023-11-17T21:23:12Z) - WALDO: Future Video Synthesis using Object Layer Decomposition and
Parametric Flow Prediction [82.79642869586587]
WALDO is a novel approach to the prediction of future video frames from past ones.
Individual images are decomposed into multiple layers combining object masks and a small set of control points.
The layer structure is shared across all frames in each video to build dense inter-frame connections.
arXiv Detail & Related papers (2022-11-25T18:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.