Bridging Video Quality Scoring and Justification via Large Multimodal Models
- URL: http://arxiv.org/abs/2506.21011v1
- Date: Thu, 26 Jun 2025 05:02:25 GMT
- Title: Bridging Video Quality Scoring and Justification via Large Multimodal Models
- Authors: Qizhi Xie, Kun Yuan, Yunpeng Qu, Jiachao Gong, Mingda Wu, Ming Sun, Chao Zhou, Jihong Zhu,
- Abstract summary: Classical video quality assessment (VQA) methods generate a numerical score to judge a video's perceived visual fidelity and clarity.<n>But a score fails to describe the video's complex quality dimensions, restricting its applicability.<n>Benefiting from the linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue.
- Score: 14.166920184033463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Classical video quality assessment (VQA) methods generate a numerical score to judge a video's perceived visual fidelity and clarity. Yet, a score fails to describe the video's complex quality dimensions, restricting its applicability. Benefiting from the linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue. The core of the approach lies in the video quality-centric instruction data. Previous explorations mainly focus on the image domain, and their data generation processes heavily rely on human quality annotations and proprietary systems, limiting data scalability and effectiveness. To address these challenges, we propose the Score-based Instruction Generation (SIG) pipeline. Specifically, SIG first scores multiple quality dimensions of an unlabeled video and maps scores to text-defined levels. It then explicitly incorporates a hierarchical Chain-of-Thought (CoT) to model the correlation between specific dimensions and overall quality, mimicking the human visual system's reasoning process. The automated pipeline eliminates the reliance on expert-written quality descriptions and proprietary systems, ensuring data scalability and generation efficiency. To this end, the resulting Score2Instruct (S2I) dataset contains over 320K diverse instruction-response pairs, laying the basis for instruction tuning. Moreover, to advance video LMMs' quality scoring and justification abilities simultaneously, we devise a progressive tuning strategy to fully unleash the power of S2I. Built upon SIG, we further curate a benchmark termed S2I-Bench with 400 open-ended questions to better evaluate the quality justification capacity of video LMMs. Experimental results on the S2I-Bench and existing benchmarks indicate that our method consistently improves quality scoring and justification capabilities across multiple video LMMs.
Related papers
- VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation [23.701884816475403]
Video captions play a crucial role in text-to-video generation tasks.<n>Existing benchmarks inadequately address fine-grained evaluation.<n>We introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench)
arXiv Detail & Related papers (2025-05-29T14:34:25Z) - Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision [49.46606936180063]
Video quality assessment (VQA) is essential for quantifying quality in various video processing systems.<n>We introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos.<n>By training on a dataset $10times$ larger than the existing VQA benchmarks, our model achieves zero-shot performance.
arXiv Detail & Related papers (2025-05-06T15:29:32Z) - AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM [54.44479359918971]
We first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 prompts.
We then introduce AIGV-Assessor, a novel VQA model that leverages intricate quality attributes to capture precise video quality scores and pair video preferences.
arXiv Detail & Related papers (2024-11-26T08:43:15Z) - VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.<n>Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.<n>We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.<n>The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs [76.15356325947731]
We introduce Q-Bench-Video, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality.<n>We collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs.<n>Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance.
arXiv Detail & Related papers (2024-09-30T08:05:00Z) - Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model [56.03592388332793]
We investigate the AIGC-VQA problem, considering both subjective and objective quality assessment perspectives.<n>For the subjective perspective, we construct the Large-scale Generated Video Quality assessment (LGVQ) dataset, consisting of 2,808 AIGC videos.<n>We evaluate the perceptual quality of AIGC videos from three critical dimensions: spatial quality, temporal quality, and text-video alignment.<n>We propose the Unify Generated Video Quality assessment (UGVQ) model, designed to accurately evaluate the multi-dimensional quality of AIGC videos.
arXiv Detail & Related papers (2024-07-31T07:54:26Z) - Q-Ground: Image Quality Grounding with Large Multi-modality Models [61.72022069880346]
We introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding.
Q-Ground combines large multi-modality models with detailed visual quality analysis.
Central to our contribution is the introduction of the QGround-100K dataset.
arXiv Detail & Related papers (2024-07-24T06:42:46Z) - KVQ: Kwai Video Quality Assessment for Short-form Videos [24.5291786508361]
We establish the first large-scale Kaleidoscope short Video database for Quality assessment, KVQ, which comprises 600 user-uploaded short videos and 3600 processed videos.
We propose the first short-form video quality evaluator, i.e., KSVQE, which enables the quality evaluator to identify the quality-determined semantics with the content understanding of large vision language models.
arXiv Detail & Related papers (2024-02-11T14:37:54Z) - Knowledge Guided Semi-Supervised Learning for Quality Assessment of User
Generated Videos [9.681456357957819]
We design a self-supervised framework to generate robust quality aware features for videos.
We then propose a dual-model based Semi Supervised Learning (SSL) method specifically designed for the Video Quality Assessment task.
Our SSL-VQA method uses the ST-VQRL backbone to produce robust performances across various VQA datasets.
Our model improves the state-of-the-art performance when trained only with limited data by around 10%, and by around 15% when unlabelled data is also used in SSL.
arXiv Detail & Related papers (2023-12-24T07:32:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.