Related papers: PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement

PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement

URL: http://arxiv.org/abs/2505.12266v2
Date: Sat, 24 May 2025 07:31:10 GMT
Title: PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement
Authors: ZhanFeng Feng, Long Peng, Xin Di, Yong Guo, Wenbo Li, Yulun Zhang, Renjing Pei, Yang Wang, Yang Cao, Zheng-Jun Zha,
Abstract summary: Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences.<n>We propose Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE)<n>This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD)
Score: 83.89668902758243
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences by leveraging temporal information from multiple frames, which are widely used in streaming video processing, surveillance, and generation. Although numerous Transformer-based enhancement methods have achieved impressive performance, their computational and memory demands hinder deployment on edge devices. Quantization offers a practical solution by reducing the bit-width of weights and activations to improve efficiency. However, directly applying existing quantization methods to video enhancement tasks often leads to significant performance degradation and loss of fine details. This stems from two limitations: (a) inability to allocate varying representational capacity across frames, which results in suboptimal dynamic range adaptation; (b) over-reliance on full-precision teachers, which limits the learning of low-bit student models. To tackle these challenges, we propose a novel quantization method for video enhancement: Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE). This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD). BMFQ utilizes a percentile-based initialization and iterative search with pruning and backtracking for robust clipping bounds. PMTD employs a progressive distillation strategy with both full-precision and multiple high-bit (INT) teachers to enhance low-bit models' capacity and quality. Extensive experiments demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance across multiple tasks and benchmarks.The code will be made publicly available at: https://github.com/xiaoBIGfeng/PMQ-VE.

Related papers

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration [21.69452489173625]
"Less is more" phenomenon where excessive frames can paradoxically degrade performance due to context dilution.<n>"Visual echoes" yield significant temporal redundancy, which we term 'visual echoes'<n>"AFP" employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives.<n>Our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%.
arXiv Detail & Related papers (2025-08-05T11:31:55Z)
Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers [31.95947876513405]
We present QDi-VT, a quantization framework specifically designed for video DiT models.<n>From the quantization perspective, we propose the Token-aware Quantization Estor (TQE), which compensates for quantization errors in both the token and feature dimensions.<n>Our W3A6 QDi-VT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$times$.
arXiv Detail & Related papers (2025-05-28T09:33:52Z)
DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time.<n>We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models.<n>Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z)
VidTok: A Versatile and Open-Source Video Tokenizer [24.018360305535307]
VidTok is a versatile video tokenizer that delivers state-of-the-art performance in both continuous and discrete tokenizations.<n>By integrating these advancements, VidTok achieves substantial improvements over existing methods.
arXiv Detail & Related papers (2024-12-17T16:27:11Z)
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z)
QVD: Post-training Quantization for Video Diffusion Models [33.13078954859106]
Post-training quantization (PTQ) is an effective technique to reduce memory footprint and improve computational efficiency. We introduce the first PTQ strategy tailored for video diffusion models, dubbed QVD. We achieve near-lossless performance degradation on W8A8, outperforming the current methods by 205.12 in FVD.
arXiv Detail & Related papers (2024-07-16T10:47:27Z)
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.<n>We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.<n>We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.