Related papers: Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision

Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision

URL: http://arxiv.org/abs/2505.03631v2
Date: Wed, 07 May 2025 10:07:00 GMT
Title: Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision
Authors: Linhan Cao, Wei Sun, Kaiwei Zhang, Yicong Peng, Guangtao Zhai, Xiongkuo Min,
Abstract summary: Video quality assessment (VQA) is essential for quantifying quality in various video processing systems.<n>We introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos.<n>By training on a dataset $10times$ larger than the existing VQA benchmarks, our model achieves zero-shot performance.
Score: 49.46606936180063
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video quality assessment (VQA) is essential for quantifying perceptual quality in various video processing workflows, spanning from camera capture systems to over-the-top streaming platforms. While recent supervised VQA models have made substantial progress, the reliance on manually annotated datasets -- a process that is labor-intensive, costly, and difficult to scale up -- has hindered further optimization of their generalization to unseen video content and distortions. To bridge this gap, we introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos. Our approach leverages a \textbf{learning-to-rank} paradigm to train a large multimodal model (LMM) on video pairs automatically labeled via two manners, including quality pseudo-labeling by existing VQA models and relative quality ranking based on synthetic distortion simulations. Furthermore, we introduce a novel \textbf{iterative self-improvement training strategy}, where the trained model acts an improved annotator to iteratively refine the annotation quality of training data. By training on a dataset $10\times$ larger than the existing VQA benchmarks, our model: (1) achieves zero-shot performance on in-domain VQA benchmarks that matches or surpasses supervised models; (2) demonstrates superior out-of-distribution (OOD) generalization across diverse video content and distortions; and (3) sets a new state-of-the-art when fine-tuned on human-labeled datasets. Extensive experimental results validate the effectiveness of our self-supervised approach in training generalized VQA models. The datasets and code will be publicly released to facilitate future research.

Related papers

Vision Large Language Models Are Good Noise Handlers in Engagement Analysis [54.397912827957164]
We propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process.<n>Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets.<n>We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements.
arXiv Detail & Related papers (2025-11-18T18:50:26Z)
Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation [33.51239538610773]
Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks.<n>We propose Q-CLIP, the first fully VLMs-based framework for Video Quality Assessment (VQA)
arXiv Detail & Related papers (2025-08-08T07:36:01Z)
VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning [50.34205095371895]
Video quality assessment aims to objectively quantify perceptual quality degradation.<n>Existing VQA models suffer from two critical limitations.<n>We propose textbfVQAThinker, a reasoning-based VQA framework.
arXiv Detail & Related papers (2025-08-08T06:16:23Z)
VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning [21.35520258725298]
VQ-Insight is a novel reasoning-style framework for AIGC video quality assessment.<n>It combines image quality warm-up, general task-specific temporal learning, and joint optimization with the video generation model.<n>It consistently outperforms state-of-the-art baselines in preference comparison, multi-dimension scoring, and natural video scoring.
arXiv Detail & Related papers (2025-06-23T12:20:14Z)
Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment [10.701522670464463]
multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments.<n>We propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage.<n>We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder.
arXiv Detail & Related papers (2025-06-03T10:11:51Z)
Scaling-up Perceptual Video Quality Assessment [54.691252495691955]
We show how to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases.<n>Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge.<n>Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks.
arXiv Detail & Related papers (2025-05-28T16:24:52Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
Video Quality Assessment: A Comprehensive Survey [55.734935003021576]
Video quality assessment (VQA) is an important processing task, aiming at predicting the quality of videos in a manner consistent with human judgments of perceived quality.<n>We present a survey of recent progress in the development of VQA algorithms and the benchmarking studies and databases that make them possible.
arXiv Detail & Related papers (2024-12-04T05:25:17Z)
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model [22.188795668927586]
Video Foundation Models (VFMs) have made significant progress recently. Existing benchmarks and evaluation protocols are often limited by relatively poor diversity, high evaluation costs, and saturated performance metrics. We build a comprehensive benchmark suite to address these issues, namely VideoEval.
arXiv Detail & Related papers (2024-07-09T01:49:08Z)
GAIA: Rethinking Action Quality Assessment for AI-Generated Videos [56.047773400426486]
Action quality assessment (AQA) algorithms predominantly focus on actions from real specific scenarios and are pre-trained with normative action features. We construct GAIA, a Generic AI-generated Action dataset, by conducting a large-scale subjective evaluation from a novel causal reasoning-based perspective. Results show that traditional AQA methods, action-related metrics in recent T2V benchmarks, and mainstream video quality methods perform poorly with an average SRCC of 0.454, 0.191, and 0.519, respectively.
arXiv Detail & Related papers (2024-06-10T08:18:07Z)
PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild [27.195339506769457]
Video quality assessment (VQA) is a challenging problem due to the numerous factors that can affect the perceptual quality of a video. Annotating the Mean opinion score (MOS) for videos is expensive and time-consuming, which limits the scale of VQA datasets. We propose a VQA method named PTM-VQA, which leverages PreTrained Models to transfer knowledge from models pretrained on various pre-tasks.
arXiv Detail & Related papers (2024-05-28T02:37:29Z)
Ada-DQA: Adaptive Diverse Quality-aware Feature Acquisition for Video Quality Assessment [25.5501280406614]
Video quality assessment (VQA) has attracted growing attention in recent years. The great expense of annotating large-scale VQA datasets has become the main obstacle for current deep-learning methods. An Adaptive Diverse Quality-aware feature Acquisition (Ada-DQA) framework is proposed to capture desired quality-related features.
arXiv Detail & Related papers (2023-08-01T16:04:42Z)
Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models [71.06007696593704]
Blind quality assessment (BVQA) plays an indispensable role in monitoring and improving the end-users' viewing experience in real-world video-enabled media applications. As an experimental field, the improvements of BVQA models have been measured primarily on a few human-rated VQA datasets. We conduct a first-of-its-kind computational analysis of VQA datasets via minimalistic BVQA models.
arXiv Detail & Related papers (2023-07-26T06:38:33Z)
CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z)
Unified Quality Assessment of In-the-Wild Videos with Mixed Datasets Training [20.288424566444224]
We focus on automatically assessing the quality of in-the-wild videos in computer vision applications. To improve the performance of quality assessment models, we borrow intuitions from human perception. We propose a mixed datasets training strategy for training a single VQA model with multiple datasets.
arXiv Detail & Related papers (2020-11-09T09:22:57Z)
UGC-VQA: Benchmarking Blind Video Quality Assessment for User Generated Content [59.13821614689478]
Blind quality prediction of in-the-wild videos is quite challenging, since the quality degradations of content are unpredictable, complicated, and often commingled. Here we contribute to advancing the problem by conducting a comprehensive evaluation of leading VQA models. By employing a feature selection strategy on top of leading VQA model features, we are able to extract 60 of the 763 statistical features used by the leading models. Our experimental results show that VIDEVAL achieves state-of-theart performance at considerably lower computational cost than other leading models.
arXiv Detail & Related papers (2020-05-29T00:39:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.