Related papers: FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing

FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing

URL: http://arxiv.org/abs/2601.01720v2
Date: Tue, 06 Jan 2026 11:17:15 GMT
Title: FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
Authors: Xijie Huang, Chengming Xu, Donghao Luo, Xiaobin Hu, Peng Tang, Xu Peng, Jiangning Zhang, Chengjie Wang, Yanwei Fu,
Abstract summary: We introduce FFP-300K, a new large-scale dataset of high-fidelity video pairs at 720p resolution and 81 frames in length.<n>We propose a novel framework designed for true guidance-free FFP that resolves the tension between maintaining first-frame appearance and preserving source video motion.
Score: 97.35186681023025
License: http://creativecommons.org/licenses/by/4.0/
Abstract: First-Frame Propagation (FFP) offers a promising paradigm for controllable video editing, but existing methods are hampered by a reliance on cumbersome run-time guidance. We identify the root cause of this limitation as the inadequacy of current training datasets, which are often too short, low-resolution, and lack the task diversity required to teach robust temporal priors. To address this foundational data gap, we first introduce FFP-300K, a new large-scale dataset comprising 300K high-fidelity video pairs at 720p resolution and 81 frames in length, constructed via a principled two-track pipeline for diverse local and global edits. Building on this dataset, we propose a novel framework designed for true guidance-free FFP that resolves the critical tension between maintaining first-frame appearance and preserving source video motion. Architecturally, we introduce Adaptive Spatio-Temporal RoPE (AST-RoPE), which dynamically remaps positional encodings to disentangle appearance and motion references. At the objective level, we employ a self-distillation strategy where an identity propagation task acts as a powerful regularizer, ensuring long-term temporal stability and preventing semantic drift. Comprehensive experiments on the EditVerseBench benchmark demonstrate that our method significantly outperforming existing academic and commercial models by receiving about 0.2 PickScore and 0.3 VLM score improvement against these competitors.

Related papers

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models [54.564740558030245]
We present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism.<n>We also introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting.
arXiv Detail & Related papers (2026-02-26T12:54:46Z)
StableDPT: Temporal Stable Monocular Video Depth Estimation [14.453483279783908]
We propose a novel approach that adapts any state-of-the-art image-based (depth) estimation model for video processing.<n>Our architecture builds upon an off-the-shelf Vision Transformer (ViT) encoder and enhances the Dense Prediction Transformer (DPT) head.<n> Evaluations on multiple benchmark datasets demonstrate improved temporal consistency, competitive state-of-the-art performance and on top 2x faster processing in real-world scenarios.
arXiv Detail & Related papers (2026-01-06T08:02:14Z)
Local2Global query Alignment for Video Instance Segmentation [6.422775545814375]
Video segmentation methods excel at handling long sequences and capturing gradual changes, making them ideal for real-world applications.<n>This paper introduces Local2Global, an online framework, for instance segmentation, exhibiting state-of-the-art performance with simple baseline and training purely in online fashion.<n>We propose the L2G-aligner, a novel lightweight transformer decoder, to facilitate an early alignment between local and global queries.
arXiv Detail & Related papers (2025-07-27T04:04:01Z)
Semantic Frame Interpolation [66.81586538775366]
Traditional frame tasks primarily focus on scenarios with a small number of frames, no text control, and minimal differences between the first and last frames.<n>Recent community developers have utilized large video models represented by Wan to endow frame-to-frame capabilities.<n>We first propose a new practical Semantic Frame Interpolation (SFI) task from the perspective of academic definition, which covers the above two settings and supports inference at multiple frame rates.
arXiv Detail & Related papers (2025-07-07T16:25:47Z)
Lightweight Multi-Frame Integration for Robust YOLO Object Detection in Videos [11.532574301455854]
We propose a highly effective strategy for multi-frame video object detection.<n>Our method improves robustness, especially for lightweight models.<n>We contribute the BOAT360 benchmark dataset to support future research in multi-frame video object detection in challenging real-world scenarios.
arXiv Detail & Related papers (2025-06-25T15:49:07Z)
Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition [6.168286187549952]
We propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330times$ versus prior art.<n> Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance.
arXiv Detail & Related papers (2025-03-17T21:13:48Z)
Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition [52.89441679581216]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise.<n>We present an innovative video decomposition strategy that incorporates view-independent and view-dependent components.<n>Our framework consistently outperforms existing methods, establishing a new SOTA performance.
arXiv Detail & Related papers (2024-05-24T15:56:40Z)
Video Dynamics Prior: An Internal Learning Approach for Robust Video Enhancements [83.5820690348833]
We present a framework for low-level vision tasks that does not require any external training data corpus. Our approach learns neural modules by optimizing over a corrupted sequence, leveraging the weights of the coherence-temporal test and statistics internal statistics.
arXiv Detail & Related papers (2023-12-13T01:57:11Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.