Related papers: CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video

CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video

URL: http://arxiv.org/abs/2511.07290v1
Date: Mon, 10 Nov 2025 16:37:47 GMT
Title: CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video
Authors: Xinyi Wang, Angeliki Katsenou, Junxiao Shen, David Bull,
Abstract summary: We propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large models.<n>Our approach introduces a quality-aware video metadata mechanism that integrates key fragments extracted from inter-frame variations.<n>Our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations.
Score: 9.172799792564009
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The prevalence of user-generated content (UGC) on platforms such as YouTube and TikTok has rendered no-reference (NR) perceptual video quality assessment (VQA) vital for optimizing video delivery. Nonetheless, the characteristics of non-professional acquisition and the subsequent transcoding of UGC video on sharing platforms present significant challenges for NR-VQA. Although NR-VQA models attempt to infer mean opinion scores (MOS), their modeling of subjective scores for compressed content remains limited due to the absence of fine-grained perceptual annotations of artifact types. To address these challenges, we propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large vision-language models. Our approach introduces a quality-aware prompting mechanism that integrates video metadata (e.g., resolution, frame rate, bitrate) with key fragments extracted from inter-frame variations to guide the BLIP-2 pretraining approach in generating fine-grained quality captions. A unified architecture has been designed to model perceptual quality across three dimensions: semantic alignment, temporal characteristics, and spatial characteristics. These multimodal features are extracted and fused, then regressed to video quality scores. Extensive experiments on a wide variety of UGC datasets demonstrate that our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations. Our method achieves the best performance in terms of average rank and linear correlation (SRCC: 0.928, PLCC: 0.938) compared to state-of-the-art methods. The source code and trained models, along with a user-friendly demo, are available at: https://github.com/xinyiW915/CAMP-VQA.

Related papers

Q-Save: Towards Scoring and Attribution for Generated Video Evaluation [65.83319736145869]
We present Q-Save, a new benchmark dataset and model for holistic evaluation of AI-generated video (AIGV) quality.<n>The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels.<n>We propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation.
arXiv Detail & Related papers (2025-11-24T07:00:21Z)
Towards Generalized Video Quality Assessment: A Weak-to-Strong Learning Paradigm [76.63001244080313]
Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception.<n>The dominant VQA paradigm relies on supervised training with human-labeled datasets.<n>We explore weak-to-strong (W2S) learning as a new paradigm for advancing VQA without reliance on large-scale human-labeled datasets.
arXiv Detail & Related papers (2025-05-06T15:29:32Z)
DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor [22.35724335601674]
Video Quality Assessment (VQA) aims to evaluate video quality based on perceptual distortions and human preferences.<n>We introduce a novel VQA framework, DiffVQA, which harnesses the robust generalization capabilities of diffusion models pre-trained on extensive datasets.
arXiv Detail & Related papers (2025-05-06T07:42:24Z)
ReLaX-VQA: Residual Fragment and Layer Stack Extraction for Enhancing Video Quality Assessment [35.00766551093652]
ReLaX-VQA is a novel No-Reference Video Quality Assessment (NRVQA) model.<n>It aims to address the challenges of evaluating the quality of diverse video content without reference to the original uncompressed videos.<n>It consistently outperforms existing NR-VQA methods, achieving an average S-Score of 0.8658 and PLCC of 0.8873.
arXiv Detail & Related papers (2024-07-16T08:33:55Z)
CLIPVQA:Video Quality Assessment via CLIP [56.94085651315878]
We propose an efficient CLIP-based Transformer method for the VQA problem ( CLIPVQA) The proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods.
arXiv Detail & Related papers (2024-07-06T02:32:28Z)
PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild [27.195339506769457]
Video quality assessment (VQA) is a challenging problem due to the numerous factors that can affect the perceptual quality of a video. Annotating the Mean opinion score (MOS) for videos is expensive and time-consuming, which limits the scale of VQA datasets. We propose a VQA method named PTM-VQA, which leverages PreTrained Models to transfer knowledge from models pretrained on various pre-tasks.
arXiv Detail & Related papers (2024-05-28T02:37:29Z)
Enhancing Blind Video Quality Assessment with Rich Quality-aware Features [79.18772373737724]
We present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. We explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features. Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets.
arXiv Detail & Related papers (2024-05-14T16:32:11Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z)
A Deep Learning based No-reference Quality Assessment Model for UGC Videos [44.00578772367465]
Previous video quality assessment (VQA) studies either use the image recognition model or the image quality assessment (IQA) models to extract frame-level features of videos for quality regression. We propose a very simple but effective VQA model, which trains an end-to-end spatial feature extraction network to learn the quality-aware spatial feature representation from raw pixels of the video frames. With the better quality-aware features, we only use the simple multilayer perception layer (MLP) network to regress them into the chunk-level quality scores, and then the temporal average pooling strategy is adopted to obtain the video
arXiv Detail & Related papers (2022-04-29T12:45:21Z)
UGC-VQA: Benchmarking Blind Video Quality Assessment for User Generated Content [59.13821614689478]
Blind quality prediction of in-the-wild videos is quite challenging, since the quality degradations of content are unpredictable, complicated, and often commingled. Here we contribute to advancing the problem by conducting a comprehensive evaluation of leading VQA models. By employing a feature selection strategy on top of leading VQA model features, we are able to extract 60 of the 763 statistical features used by the leading models. Our experimental results show that VIDEVAL achieves state-of-theart performance at considerably lower computational cost than other leading models.
arXiv Detail & Related papers (2020-05-29T00:39:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.