Related papers: VALA: Learning Latent Anchors for Training-Free and Temporally Consistent

VALA: Learning Latent Anchors for Training-Free and Temporally Consistent

URL: http://arxiv.org/abs/2510.22970v1
Date: Mon, 27 Oct 2025 03:44:11 GMT
Title: VALA: Learning Latent Anchors for Training-Free and Temporally Consistent
Authors: Zhangkai Wu, Xuhui Fan, Zhongyuan Xie, Kaize Shi, Longbing Cao,
Abstract summary: We propose VALA, a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing.<n>Our method can be fully integrated into training-free text-to-image based video editing models.
Score: 29.516179213427694
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to maintain temporal consistency during DDIM inversion, which introduces manual bias and reduces the scalability of end-to-end inference. In this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors), a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing. To learn meaningful assignments, VALA propose a variational framework with a contrastive learning objective. Therefore, it can transform cross-frame latent representations into compressed latent anchors that preserve both content and temporal coherence. Our method can be fully integrated into training-free text-to-image based video editing models. Extensive experiments on real-world video editing benchmarks show that VALA achieves state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency, while offering improved efficiency over prior methods.

Related papers

PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models [35.59605874012795]
PropFly is a training pipeline for propagation-based video editing.<n>PropFly relies on pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets.<n>Our pipeline enables an adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss.
arXiv Detail & Related papers (2026-02-24T06:11:08Z)
Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence [81.82643953694485]
We present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint.<n>Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video.<n>We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing.
arXiv Detail & Related papers (2025-12-03T15:51:11Z)
Low-Cost Test-Time Adaptation for Robust Video Editing [4.707015344498921]
Video editing is a critical component of content creation that transforms raw footage into coherent works aligned with specific visual and narrative objectives.<n>Existing approaches face two major challenges: temporal inconsistencies due to failure in capturing complex motion patterns, and overfitting to simple prompts arising from limitations in UNet backbone architectures.<n>We present Vid-TTA, a lightweight test-time adaptation framework that personalizes optimization for each test video during inference through self-supervised auxiliary tasks.
arXiv Detail & Related papers (2025-07-29T14:31:17Z)
STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing [35.50656689789427]
STR-Match is a training-free video editing system that produces visually appealing and coherent videos.<n> STR-Match consistently outperforms existing methods in both visual quality andtemporal consistency.
arXiv Detail & Related papers (2025-06-28T12:36:19Z)
FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing [2.7248421583285265]
FlowDirector is a novel inversion-free video editing framework.<n>Our framework models the editing process as a direct evolution in data space.<n>To achieve localized and controllable edits, we introduce an attention-guided masking mechanism.
arXiv Detail & Related papers (2025-06-05T13:54:40Z)
Motion-Aware Concept Alignment for Consistent Video Editing [57.08108545219043]
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
arXiv Detail & Related papers (2025-06-01T13:28:04Z)
Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks. ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z)
COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.<n>We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.<n>COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z)
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z)
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing. It attains temporally consistent editing of input videos in a training-free manner. Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z)
RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images. Existing methods invert video frames individually often leading to undesired inconsistent results over time. We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID) Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.