Related papers: LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation

LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation

URL: http://arxiv.org/abs/2508.04228v1
Date: Wed, 06 Aug 2025 09:03:16 GMT
Title: LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation
Authors: Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Xiaohong Liu,
Abstract summary: Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area.<n>We introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer.<n>Experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods.
Score: 33.26383352897258
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct "layer" and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at https://kr-panghu.github.io/LayerT2V/ .

Related papers

MOVi: Training-free Text-conditioned Multi-Object Video Generation [43.612899589093075]
We present a training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs)<n>We use an LLM as the director'' of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements.<n>Experiments validate the effectiveness of our training free approach in significantly enhancing the multi-object generation capabilities of existing video diffusion models.
arXiv Detail & Related papers (2025-05-29T01:41:10Z)
Segment Any Motion in Videos [80.72424676419755]
We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features.<n>Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support.
arXiv Detail & Related papers (2025-03-28T09:34:11Z)
Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think [24.308538128761985]
Image-to-Video (I2V) generation aims to synthesize a video clip according to a given image and condition (e.g., text)<n>Key challenge of this task lies in simultaneously generating natural motions while preserving the original appearance of the images.<n>We propose a novel Extrapolating and Decoupling framework, which introduces model merging techniques to the I2V domain for the first time.
arXiv Detail & Related papers (2025-03-02T16:06:16Z)
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation [52.337472185022136]
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description.<n>We propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation.<n>We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art consistency.
arXiv Detail & Related papers (2025-01-06T14:49:26Z)
InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios. Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv Detail & Related papers (2024-11-25T14:27:50Z)
Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z)
Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation. We introduce a scalable pipeline for generating synthetic training data with multiple objects. We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z)
Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS) Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage. We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.