Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation
- URL: http://arxiv.org/abs/2508.06092v2
- Date: Thu, 09 Oct 2025 11:58:11 GMT
- Title: Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation
- Authors: Yachun Mi, Yu Li, Yanting Li, Chen Hui, Tong Zhang, Zhixuan Li, Chenyue Song, Wei Yang Bryan Lim, Shaohui Liu,
- Abstract summary: Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks.<n>We propose Q-CLIP, the first fully VLMs-based framework for Video Quality Assessment (VQA)
- Score: 33.51239538610773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment. In this work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the VLMs in perceiving subtle quality variations, thereby further enhancing the model's sensitivity to video quality. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance, and find that frame-difference-based sampling leads to better generalization performance across datasets. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.
Related papers
- VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning [50.34205095371895]
Video quality assessment aims to objectively quantify perceptual quality degradation.<n>Existing VQA models suffer from two critical limitations.<n>We propose textbfVQAThinker, a reasoning-based VQA framework.
arXiv Detail & Related papers (2025-08-08T06:16:23Z) - Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision [49.46606936180063]
Video quality assessment (VQA) is essential for quantifying quality in various video processing systems.<n>We introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos.<n>By training on a dataset $10times$ larger than the existing VQA benchmarks, our model achieves zero-shot performance.
arXiv Detail & Related papers (2025-05-06T15:29:32Z) - VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.<n>Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.<n>We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.<n>The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models [53.64461404882853]
Video quality assessment (VQA) algorithms are needed to monitor and optimize the quality of streaming videos.
Here, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel visual modeling strategy for quality-aware feature extraction.
arXiv Detail & Related papers (2024-08-26T04:29:52Z) - CLIPVQA:Video Quality Assessment via CLIP [56.94085651315878]
We propose an efficient CLIP-based Transformer method for the VQA problem ( CLIPVQA)
The proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods.
arXiv Detail & Related papers (2024-07-06T02:32:28Z) - Enhancing Blind Video Quality Assessment with Rich Quality-aware Features [79.18772373737724]
We present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos.
We explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features.
Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets.
arXiv Detail & Related papers (2024-05-14T16:32:11Z) - Ada-DQA: Adaptive Diverse Quality-aware Feature Acquisition for Video
Quality Assessment [25.5501280406614]
Video quality assessment (VQA) has attracted growing attention in recent years.
The great expense of annotating large-scale VQA datasets has become the main obstacle for current deep-learning methods.
An Adaptive Diverse Quality-aware feature Acquisition (Ada-DQA) framework is proposed to capture desired quality-related features.
arXiv Detail & Related papers (2023-08-01T16:04:42Z) - Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models [71.06007696593704]
Blind quality assessment (BVQA) plays an indispensable role in monitoring and improving the end-users' viewing experience in real-world video-enabled media applications.
As an experimental field, the improvements of BVQA models have been measured primarily on a few human-rated VQA datasets.
We conduct a first-of-its-kind computational analysis of VQA datasets via minimalistic BVQA models.
arXiv Detail & Related papers (2023-07-26T06:38:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.