Related papers: Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

URL: http://arxiv.org/abs/2506.09229v2
Date: Wed, 25 Jun 2025 15:46:16 GMT
Title: Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models
Authors: Sungwon Hwang, Hyojin Jang, Kinam Kim, Minho Park, Jaegul Choo,
Abstract summary: Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges.<n>Recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models.<n>We introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames.
Score: 31.138079872368532
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability.

Related papers

Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features [51.5076190312734]
Video Super-Resolution approaches suffer from error accumulation, spatial artifacts, and a trade-off between perceptual quality and fidelity.<n>We propose a novelly Guided diffusion model with Aligned Features for Video Super-Resolution (DGAF-VSR)<n>Experiments on synthetic and real-world datasets demonstrate that DGAF-VSR surpasses state-of-the-art methods in key aspects of VSR.
arXiv Detail & Related papers (2025-11-21T03:40:45Z)
FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z)
DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor [22.35724335601674]
Video Quality Assessment (VQA) aims to evaluate video quality based on perceptual distortions and human preferences.<n>We introduce a novel VQA framework, DiffVQA, which harnesses the robust generalization capabilities of diffusion models pre-trained on extensive datasets.
arXiv Detail & Related papers (2025-05-06T07:42:24Z)
Saliency-Motion Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation [8.912201177914858]
Saliency-Motion guided Trunk-Collateral Network (SMTC-Net)<n>We propose a novel Trunk-Collateral structure for motion-appearance video object segmentation (UVOS)<n>SMTC-Net achieved state-of-the-art performance on three UVOS datasets.
arXiv Detail & Related papers (2025-04-08T11:02:14Z)
Temporal-Consistent Video Restoration with Pre-trained Diffusion Models [51.47188802535954]
Video restoration (VR) aims to recover high-quality videos from degraded ones.<n>Recent zero-shot VR methods using pre-trained diffusion models (DMs) suffer from approximation errors during reverse diffusion and insufficient temporal consistency.<n>We present a novel a Posterior Maximum (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors.
arXiv Detail & Related papers (2025-03-19T03:41:56Z)
RepVideo: Rethinking Cross-Layer Representation for Video Generation [53.701548524818534]
We propose RepVideo, an enhanced representation framework for text-to-video diffusion models.<n>By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information.<n>Our experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, but also improves temporal consistency in video generation.
arXiv Detail & Related papers (2025-01-15T18:20:37Z)
Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z)
Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition [52.89441679581216]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise.<n>We present an innovative video decomposition strategy that incorporates view-independent and view-dependent components.<n>Our framework consistently outperforms existing methods, establishing a new SOTA performance.
arXiv Detail & Related papers (2024-05-24T15:56:40Z)
Event-based Video Frame Interpolation with Edge Guided Motion Refinement [28.331148083668857]
We introduce an end-to-end E-VFI learning method to efficiently utilize edge features from event signals for motion flow and warping enhancement. Our method incorporates an Edge Guided Attentive (EGA) module, which rectifies estimated video motion through attentive aggregation. Experiments on both synthetic and real datasets show the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-04-28T12:13:34Z)
Motion-aware Latent Diffusion Models for Video Frame Interpolation [51.78737270917301]
Motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. We propose a novel diffusion framework, motion-aware latent diffusion models (MADiff) Our method achieves state-of-the-art performance significantly outperforming existing approaches.
arXiv Detail & Related papers (2024-04-21T05:09:56Z)
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z)
MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance [10.457759140533168]
This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores.
arXiv Detail & Related papers (2023-08-19T17:59:12Z)
Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring. FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames. Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.