Related papers: Scaling-up Perceptual Video Quality Assessment

Scaling-up Perceptual Video Quality Assessment

URL: http://arxiv.org/abs/2505.22543v1
Date: Wed, 28 May 2025 16:24:52 GMT
Title: Scaling-up Perceptual Video Quality Assessment
Authors: Ziheng Jia, Zicheng Zhang, Zeyu Zhang, Yingji Liang, Xiaorong Zhu, Chunyi Li, Jinliang Han, Haoning Wu, Bin Wang, Haoran Zhang, Guanyu Zhu, Qiyong Zhao, Xiaohong Liu, Guangtao Zhai, Xiongkuo Min,
Abstract summary: We show how to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases.<n>Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge.<n>Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks.
Score: 54.691252495691955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The data scaling law has been shown to significantly enhance the performance of large multi-modal models (LMMs) across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of scaling law remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose \textbf{OmniVQA}, an efficient framework designed to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases (MIDBs). We then scale up to create \textbf{OmniVQA-Chat-400K}, the largest MIDB in the VQA field concurrently. Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge. Additionally, we have built the \textbf{OmniVQA-MOS-20K} dataset to enhance the model's quantitative quality rating capabilities. We then introduce a \textbf{complementary} training strategy that effectively leverages the knowledge from datasets for quality understanding and quality rating tasks. Furthermore, we propose the \textbf{OmniVQA-FG (fine-grain)-Benchmark} to evaluate the fine-grained performance of the models. Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks.

Related papers

QualiRAG: Retrieval-Augmented Generation for Visual Quality Understanding [80.66379018208568]
Visual quality assessment is shifting from prediction to interpretable quality understanding.<n>Current approaches rely on supervised fine-tuning or reinforcement learning on curated instruction.<n>We propose VbfQualiRAG, a framework that systematically leverages latent perceptual knowledge of large multimodal models for visual quality perception.
arXiv Detail & Related papers (2026-01-26T06:27:03Z)
iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA [10.857047397246598]
iDETEX is a unified multimodal large language model (MLLM) capable of simultaneously performing three key tasks: quality grounding, perception, and description.<n>We validate our approach on the large-scale ViDA-UGC benchmark, where iDETEX achieves state-of-the-art performance across all subtasks.
arXiv Detail & Related papers (2025-10-20T09:26:12Z)
Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation [33.51239538610773]
Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks.<n>We propose Q-CLIP, the first fully VLMs-based framework for Video Quality Assessment (VQA)
arXiv Detail & Related papers (2025-08-08T07:36:01Z)
Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision [49.46606936180063]
Video quality assessment (VQA) is essential for quantifying quality in various video processing systems.<n>We introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos.<n>By training on a dataset $10times$ larger than the existing VQA benchmarks, our model achieves zero-shot performance.
arXiv Detail & Related papers (2025-05-06T15:29:32Z)
VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.<n>Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.<n>We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.<n>The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z)
Q-Ground: Image Quality Grounding with Large Multi-modality Models [61.72022069880346]
We introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding. Q-Ground combines large multi-modality models with detailed visual quality analysis. Central to our contribution is the introduction of the QGround-100K dataset.
arXiv Detail & Related papers (2024-07-24T06:42:46Z)
ATTIQA: Generalizable Image Quality Feature Extractor using Attribute-aware Pretraining [25.680035174334886]
In no-reference image quality assessment (NR-IQA), the challenge of limited dataset sizes hampers the development of robust and generalizable models. We propose a novel pretraining framework that constructs a generalizable representation for IQA by selectively extracting quality-related knowledge. Our approach achieves state-of-the-art performance on multiple IQA datasets and exhibits remarkable generalization capabilities.
arXiv Detail & Related papers (2024-06-03T06:03:57Z)
Descriptive Image Quality Assessment in the Wild [25.503311093471076]
VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression. We introduce Depicted image Quality Assessment in the Wild (DepictQA-Wild) Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios.
arXiv Detail & Related papers (2024-05-29T07:49:15Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
Ada-DQA: Adaptive Diverse Quality-aware Feature Acquisition for Video Quality Assessment [25.5501280406614]
Video quality assessment (VQA) has attracted growing attention in recent years. The great expense of annotating large-scale VQA datasets has become the main obstacle for current deep-learning methods. An Adaptive Diverse Quality-aware feature Acquisition (Ada-DQA) framework is proposed to capture desired quality-related features.
arXiv Detail & Related papers (2023-08-01T16:04:42Z)
Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models [71.06007696593704]
Blind quality assessment (BVQA) plays an indispensable role in monitoring and improving the end-users' viewing experience in real-world video-enabled media applications. As an experimental field, the improvements of BVQA models have been measured primarily on a few human-rated VQA datasets. We conduct a first-of-its-kind computational analysis of VQA datasets via minimalistic BVQA models.
arXiv Detail & Related papers (2023-07-26T06:38:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.