Related papers: DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor

DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor

URL: http://arxiv.org/abs/2505.03261v1
Date: Tue, 06 May 2025 07:42:24 GMT
Title: DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor
Authors: Wei-Ting Chen, Yu-Jiet Vong, Yi-Tsung Lee, Sy-Yen Kuo, Qiang Gao, Sizhuo Ma, Jian Wang,
Abstract summary: Video Quality Assessment (VQA) aims to evaluate video quality based on perceptual distortions and human preferences.<n>We introduce a novel VQA framework, DiffVQA, which harnesses the robust generalization capabilities of diffusion models pre-trained on extensive datasets.
Score: 22.35724335601674
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Quality Assessment (VQA) aims to evaluate video quality based on perceptual distortions and human preferences. Despite the promising performance of existing methods using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), they often struggle to align closely with human perceptions, particularly in diverse real-world scenarios. This challenge is exacerbated by the limited scale and diversity of available datasets. To address this limitation, we introduce a novel VQA framework, DiffVQA, which harnesses the robust generalization capabilities of diffusion models pre-trained on extensive datasets. Our framework adapts these models to reconstruct identical input frames through a control module. The adapted diffusion model is then used to extract semantic and distortion features from a resizing branch and a cropping branch, respectively. To enhance the model's ability to handle long-term temporal dynamics, a parallel Mamba module is introduced, which extracts temporal coherence augmented features that are merged with the diffusion features to predict the final score. Experiments across multiple datasets demonstrate DiffVQA's superior performance on intra-dataset evaluations and its exceptional generalization across datasets. These results confirm that leveraging a diffusion model as a feature extractor can offer enhanced VQA performance compared to CNN and ViT backbones.

Related papers

Q-Save: Towards Scoring and Attribution for Generated Video Evaluation [65.83319736145869]
We present Q-Save, a new benchmark dataset and model for holistic evaluation of AI-generated video (AIGV) quality.<n>The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels.<n>We propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation.
arXiv Detail & Related papers (2025-11-24T07:00:21Z)
CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video [9.172799792564009]
We propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large models.<n>Our approach introduces a quality-aware video metadata mechanism that integrates key fragments extracted from inter-frame variations.<n>Our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations.
arXiv Detail & Related papers (2025-11-10T16:37:47Z)
Spiking Variational Graph Representation Inference for Video Summarization [37.324654104567436]
We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity.<n>First, we design a extractor based on Spiking Neural Networks (SNN), leveraging the event-driven mechanism of SNNs to learn autonomously.<n>We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion.
arXiv Detail & Related papers (2025-08-21T09:25:42Z)
Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models [31.138079872368532]
Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges.<n>Recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models.<n>We introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames.
arXiv Detail & Related papers (2025-06-10T20:34:47Z)
Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision [49.46606936180063]
Video quality assessment (VQA) is essential for quantifying quality in various video processing systems.<n>We introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos.<n>By training on a dataset $10times$ larger than the existing VQA benchmarks, our model achieves zero-shot performance.
arXiv Detail & Related papers (2025-05-06T15:29:32Z)
DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment [17.85550556489256]
This paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA)<n>A Video-Based Temporal CLIP module is proposed to explicitly model temporal dynamics and enhance motion perception, aligning with the dorsal stream.<n>A Temporal Context Module is developed to refine inter-frame dependencies, further improving motion modeling.<n>Finally, a text-guided adaptive fusion strategy is proposed to enable more effective integration of spatial and temporal information.
arXiv Detail & Related papers (2025-04-16T03:20:28Z)
PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild [27.195339506769457]
Video quality assessment (VQA) is a challenging problem due to the numerous factors that can affect the perceptual quality of a video. Annotating the Mean opinion score (MOS) for videos is expensive and time-consuming, which limits the scale of VQA datasets. We propose a VQA method named PTM-VQA, which leverages PreTrained Models to transfer knowledge from models pretrained on various pre-tasks.
arXiv Detail & Related papers (2024-05-28T02:37:29Z)
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders. Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z)
Neighbourhood Representative Sampling for Efficient End-to-end Video Quality Assessment [60.57703721744873]
The increased resolution of real-world videos presents a dilemma between efficiency and accuracy for deep Video Quality Assessment (VQA) In this work, we propose a unified scheme, spatial-temporal grid mini-cube sampling (St-GMS) to get a novel type of sample, named fragments. With fragments and FANet, the proposed efficient end-to-end FAST-VQA and FasterVQA achieve significantly better performance than existing approaches on all VQA benchmarks.
arXiv Detail & Related papers (2022-10-11T11:38:07Z)
CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z)
DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment [56.42140467085586]
Some temporal variations are causing temporal distortions and lead to extra quality degradations. Human visual system often has different attention to frames with different contents. We propose a novel and effective transformer-based VQA method to tackle these two issues.
arXiv Detail & Related papers (2022-06-20T15:31:27Z)
PeQuENet: Perceptual Quality Enhancement of Compressed Video with Adaptation- and Attention-based Network [27.375830262287163]
We propose a generative adversarial network (GAN) framework to enhance the perceptual quality of compressed videos. Our framework includes attention and adaptation to different quantization parameters (QPs) in a single model. Experimental results demonstrate the superior performance of the proposed PeQuENet compared with the state-of-the-art compressed video quality enhancement algorithms.
arXiv Detail & Related papers (2022-06-16T02:49:28Z)
Adversarial Feature Augmentation and Normalization for Visual Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.