Related papers: DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models

DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models

URL: http://arxiv.org/abs/2505.07057v1
Date: Sun, 11 May 2025 17:08:50 GMT
Title: DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models
Authors: Junhao Xia, Chaoyang Zhang, Yecheng Zhang, Chengyang Zhou, Zhichang Wang, Bochun Liu, Dongshuo Yin,
Abstract summary: We propose DAPE, a high-quality yet cost-effective two-stage parameter-efficient fine-tuning framework for video editing.<n>In the first stage, we design an efficient norm-tuning method to enhance temporal consistency in generated videos.<n>The second stage introduces a vision-friendly adapter to improve visual quality.
Score: 1.972901110298768
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video generation based on diffusion models presents a challenging multimodal task, with video editing emerging as a pivotal direction in this field. Recent video editing approaches primarily fall into two categories: training-required and training-free methods. While training-based methods incur high computational costs, training-free alternatives often yield suboptimal performance. To address these limitations, we propose DAPE, a high-quality yet cost-effective two-stage parameter-efficient fine-tuning (PEFT) framework for video editing. In the first stage, we design an efficient norm-tuning method to enhance temporal consistency in generated videos. The second stage introduces a vision-friendly adapter to improve visual quality. Additionally, we identify critical shortcomings in existing benchmarks, including limited category diversity, imbalanced object distribution, and inconsistent frame counts. To mitigate these issues, we curate a large dataset benchmark comprising 232 videos with rich annotations and 6 editing prompts, enabling objective and comprehensive evaluation of advanced methods. Extensive experiments on existing datasets (BalanceCC, LOVEU-TGVE, RAVE) and our proposed benchmark demonstrate that DAPE significantly improves temporal coherence and text-video alignment while outperforming previous state-of-the-art approaches.

Related papers

Low-Cost Test-Time Adaptation for Robust Video Editing [4.707015344498921]
Video editing is a critical component of content creation that transforms raw footage into coherent works aligned with specific visual and narrative objectives.<n>Existing approaches face two major challenges: temporal inconsistencies due to failure in capturing complex motion patterns, and overfitting to simple prompts arising from limitations in UNet backbone architectures.<n>We present Vid-TTA, a lightweight test-time adaptation framework that personalizes optimization for each test video during inference through self-supervised auxiliary tasks.
arXiv Detail & Related papers (2025-07-29T14:31:17Z)
FADE: Frequency-Aware Diffusion Model Factorization for Video Editing [34.887298437323295]
FADE is a training-free yet highly effective video editing approach.<n>We propose a factorization strategy to optimize each component's specialized role.<n>Experiments on real-world videos demonstrate that our method consistently delivers high-quality, realistic and temporally coherent editing results.
arXiv Detail & Related papers (2025-06-06T10:00:39Z)
PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement [83.89668902758243]
Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences.<n>We propose Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE)<n>This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD)
arXiv Detail & Related papers (2025-05-18T07:10:40Z)
DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z)
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM [54.2320450886902]
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs.<n>Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware.<n>We introduce Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model.
arXiv Detail & Related papers (2024-12-19T18:32:21Z)
Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation [20.689304579898728]
Event-based Video Frame Interpolation (EVFI) uses sparse, high-temporal-resolution event measurements as motion guidance.<n>We adapt pre-trained video diffusion models trained on internet-scale datasets to EVFI.<n>Our method outperforms existing methods and generalizes across cameras far better than existing approaches.
arXiv Detail & Related papers (2024-12-10T18:55:30Z)
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models.<n>It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process.<n>The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z)
COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.<n>We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.<n>COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z)
Zero-Shot Video Editing through Adaptive Sliding Score Distillation [51.57440923362033]
This study proposes a novel paradigm of video-based score distillation, facilitating direct manipulation of original video content. We propose an Adaptive Sliding Score Distillation strategy, which incorporates both global and local video guidance to reduce the impact of editing errors.
arXiv Detail & Related papers (2024-06-07T12:33:59Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.