Related papers: TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation

TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation

URL: http://arxiv.org/abs/2506.07205v1
Date: Sun, 08 Jun 2025 16:12:13 GMT
Title: TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation
Authors: Min-Jung Kim, Dongjin Kim, Seokju Yun, Jaegul Choo,
Abstract summary: We present TV-LiVE, a Training-free and text-guided Video editing framework via Layerinformed Vitality Exploitation.<n>We empirically identify vital layers within the video generation model that significantly influence the quality of generated outputs.<n>For object addition, we identify prominent layers to extract the mask regions corresponding to the newly added target prompt.
Score: 36.81368812919819
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Video editing has garnered increasing attention alongside the rapid progress of diffusion-based video generation models. As part of these advancements, there is a growing demand for more accessible and controllable forms of video editing, such as prompt-based editing. Previous studies have primarily focused on tasks such as style transfer, background replacement, object substitution, and attribute modification, while maintaining the content structure of the source video. However, more complex tasks, including the addition of novel objects and nonrigid transformations, remain relatively unexplored. In this paper, we present TV-LiVE, a Training-free and text-guided Video editing framework via Layerinformed Vitality Exploitation. We empirically identify vital layers within the video generation model that significantly influence the quality of generated outputs. Notably, these layers are closely associated with Rotary Position Embeddings (RoPE). Based on this observation, our method enables both object addition and non-rigid video editing by selectively injecting key and value features from the source model into the corresponding layers of the target model guided by the layer vitality. For object addition, we further identify prominent layers to extract the mask regions corresponding to the newly added target prompt. We found that the extracted masks from the prominent layers faithfully indicate the region to be edited. Experimental results demonstrate that TV-LiVE outperforms existing approaches for both object addition and non-rigid video editing. Project Page: https://emjay73.github.io/TV_LiVE/

Related papers

MotionV2V: Editing Motion in a Video [53.791975554391534]
We propose modifying video motion by editing sparse trajectories extracted from the input.<n>We term the deviation between input and output trajectories a "motion edit"<n>Our approach allows for edits that start at any timestamp and propagate naturally.
arXiv Detail & Related papers (2025-11-25T18:57:25Z)
GenCompositor: Generative Video Compositing with Diffusion Transformer [68.00271033575736]
Traditional pipelines require intensive labor efforts and expert collaboration, resulting in lengthy production cycles and high manpower costs.<n>This new task strives to adaptively inject identity and motion information of foreground video to the target video in an interactive manner.<n>Experiments demonstrate that our method effectively realizes generative video compositing, outperforming existing possible solutions in fidelity and consistency.
arXiv Detail & Related papers (2025-09-02T16:10:13Z)
VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation [67.31149310468801]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions.<n> VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z)
Video Decomposition Prior: A Methodology to Decompose Videos into Layers [74.36790196133505]
This paper introduces a novel video decomposition prior VDP' framework which derives inspiration from professional video editing practices.<n>VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels.<n>We address tasks such as video object segmentation, dehazing, and relighting.
arXiv Detail & Related papers (2024-12-06T10:35:45Z)
RNA: Video Editing with ROI-based Neural Atlas [14.848279912686946]
We propose a novel region-of-interest (ROI)-based video editing framework: ROI-based Neural Atlas (RNA) Unlike prior work, RNA allows users to specify editing regions, simplifying the editing process by removing the need for foreground separation. We introduce a soft neural atlas model for video reconstruction to ensure high-quality editing results.
arXiv Detail & Related papers (2024-10-10T04:17:19Z)
Temporally Consistent Object Editing in Videos using Extended Attention [9.605596668263173]
We propose a method to edit videos using a pre-trained inpainting image diffusion model. We ensure that the edited information will be consistent across all the video frames.
arXiv Detail & Related papers (2024-06-01T02:31:16Z)
UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing [28.140945021777878]
We present UniEdit, a tuning-free framework that supports both video motion and appearance editing. To realize motion editing while preserving source video content, we introduce auxiliary motion-reference and reconstruction branches. The obtained features are then injected into the main editing path via temporal and spatial self-attention layers.
arXiv Detail & Related papers (2024-02-20T17:52:12Z)
MagicStick: Controllable Video Editing via Control Handle Transformations [49.29608051543133]
MagicStick is a controllable video editing method that edits the video properties by utilizing the transformation on the extracted internal control signals. We present experiments on numerous examples within our unified framework. We also compare with shape-aware text-based editing and handcrafted motion video generation, demonstrating our superior temporal consistency and editing capability than previous works.
arXiv Detail & Related papers (2023-12-05T17:58:06Z)
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks. Our model can edit and translate the desired results within seconds based on user instructions. We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z)
LOVECon: Text-driven Training-Free Long Video Editing with ControlNet [9.762680144118061]
This paper aims to bridge the gap, establishing a simple and effective baseline for training-free diffusion model-based long video editing. We build the pipeline upon ControlNet, which excels at various image editing tasks based on text prompts. Our method manages to edit videos comprising hundreds of frames according to user requirements.
arXiv Detail & Related papers (2023-10-15T02:39:25Z)
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing. It attains temporally consistent editing of input videos in a training-free manner. Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z)
StableVideo: Text-driven Consistency-aware Diffusion Video Editing [24.50933856309234]
Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their appearance over time. This paper introduces temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the edited objects. We build up a text-driven video editing framework based on this mechanism, namely StableVideo, which can achieve consistency-aware video editing.
arXiv Detail & Related papers (2023-08-18T14:39:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.