Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment
- URL: http://arxiv.org/abs/2501.02706v1
- Date: Mon, 06 Jan 2025 01:18:11 GMT
- Title: Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment
- Authors: Jiaze Li, Haoran Xu, Shiding Zhu, Junwei He, Haozhao Wang,
- Abstract summary: We introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment.
Our hierarchical framework analyzes video content at three levels: frame, segment, and video.
We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts.
- Score: 15.529169236891532
- License:
- Abstract: The rapid development of diffusion models has greatly advanced AI-generated videos in terms of length and consistency recently, yet assessing AI-generated videos still remains challenging. Previous approaches have often focused on User-Generated Content(UGC), but few have targeted AI-Generated Video Quality Assessment methods. In this work, we introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment, which leverages CLIP-based semantic supervision and cross-attention mechanisms. Our hierarchical framework analyzes video content at three levels: frame, segment, and video. We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts. Additionally, we propose the Semantic Mutation-aware Module to capture subtle variations between frames. Extensive experiments demonstrate our method achieves state-of-the-art results.
Related papers
- Generative Ghost: Investigating Ranking Bias Hidden in AI-Generated Videos [106.5804660736763]
Video information retrieval remains a fundamental approach for accessing video content.
We build on the observation that retrieval models often favor AI-generated content in ad-hoc and image retrieval tasks.
We investigate whether similar biases emerge in the context of challenging video retrieval.
arXiv Detail & Related papers (2025-02-11T07:43:47Z) - Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency [4.922783970210658]
We propose textbfCRAVE (underlineContent-underlineRich underlineAIGC underlineVideo underlineEvaluator) for the evaluation of Sora-era AIGC videos.
arXiv Detail & Related papers (2025-02-06T13:41:24Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model [56.03592388332793]
We investigate the AIGC-VQA problem, considering both subjective and objective quality assessment perspectives.
For the subjective perspective, we construct the Large-scale Generated Video Quality assessment (LGVQ) dataset, consisting of 2,808 AIGC videos.
We evaluate the perceptual quality of AIGC videos from three critical dimensions: spatial quality, temporal quality, and text-video alignment.
We propose the Unify Generated Video Quality assessment (UGVQ) model, designed to accurately evaluate the multi-dimensional quality of AIGC videos.
arXiv Detail & Related papers (2024-07-31T07:54:26Z) - CLIPVQA:Video Quality Assessment via CLIP [56.94085651315878]
We propose an efficient CLIP-based Transformer method for the VQA problem ( CLIPVQA)
The proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods.
arXiv Detail & Related papers (2024-07-06T02:32:28Z) - Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap [4.922783970210658]
We categorize the assessment of AIGC video quality into three dimensions: visual harmony, video-text consistency, and domain distribution gap.
For each dimension, we design specific modules to provide a comprehensive quality assessment of AIGC videos.
Our research identifies significant variations in visual quality, fluidity, and style among videos generated by different text-to-video models.
arXiv Detail & Related papers (2024-04-21T08:27:20Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Exploring Opinion-unaware Video Quality Assessment with Semantic
Affinity Criterion [52.07084862209754]
We introduce an explicit semantic affinity index for opinion-unaware VQA using text-prompts in the contrastive language-image pre-training model.
We also aggregate it with different traditional low-level naturalness indexes through gaussian normalization and sigmoid rescaling strategies.
The proposed Blind Unified Opinion-Unaware Video Quality Index via Semantic and Technical Metric Aggregation (BUONA-VISTA) outperforms existing opinion-unaware VQA methods by at least 20% improvements.
arXiv Detail & Related papers (2023-02-26T08:46:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.