Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment
- URL: http://arxiv.org/abs/2501.02706v1
- Date: Mon, 06 Jan 2025 01:18:11 GMT
- Title: Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment
- Authors: Jiaze Li, Haoran Xu, Shiding Zhu, Junwei He, Haozhao Wang,
- Abstract summary: We introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment.<n>Our hierarchical framework analyzes video content at three levels: frame, segment, and video.<n>We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts.
- Score: 15.529169236891532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid development of diffusion models has greatly advanced AI-generated videos in terms of length and consistency recently, yet assessing AI-generated videos still remains challenging. Previous approaches have often focused on User-Generated Content(UGC), but few have targeted AI-Generated Video Quality Assessment methods. In this work, we introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment, which leverages CLIP-based semantic supervision and cross-attention mechanisms. Our hierarchical framework analyzes video content at three levels: frame, segment, and video. We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts. Additionally, we propose the Semantic Mutation-aware Module to capture subtle variations between frames. Extensive experiments demonstrate our method achieves state-of-the-art results.
Related papers
- Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation [94.23160400824969]
We propose a two-stage framework that leverages "shots" as the fundamental units of video understanding.
This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures.
Our method is compatible with both open-source and proprietary Visual-Language Models.
arXiv Detail & Related papers (2025-04-01T17:59:57Z) - AIGVE-Tool: AI-Generated Video Evaluation Toolkit with Multifaceted Benchmark [8.827755848017578]
Existing metrics lack a unified framework for systematically categorizing methodologies.
We introduce AIGVE-Tool, a unified framework that provides a structured taxonomy and evaluation pipeline for AI-generated video evaluation.
A large-scale benchmark dataset is created with five SOTA video generation models based on hand-crafted instructions and prompts.
arXiv Detail & Related papers (2025-03-18T09:36:33Z) - M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment [65.3860007085689]
M3-AGIQA is a comprehensive framework for AGI quality assessment.
It includes a structured multi-round evaluation mechanism, where intermediate image descriptions are generated.
Experiments conducted on multiple benchmark datasets demonstrate that M3-AGIQA achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-02-21T03:05:45Z) - Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency [4.922783970210658]
We propose textbfCRAVE (underlineContent-underlineRich underlineAIGC underlineVideo underlineEvaluator) for the evaluation of Sora-era AIGC videos.
arXiv Detail & Related papers (2025-02-06T13:41:24Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model [56.03592388332793]
We investigate the AIGC-VQA problem, considering both subjective and objective quality assessment perspectives.<n>For the subjective perspective, we construct the Large-scale Generated Video Quality assessment (LGVQ) dataset, consisting of 2,808 AIGC videos.<n>We evaluate the perceptual quality of AIGC videos from three critical dimensions: spatial quality, temporal quality, and text-video alignment.<n>We propose the Unify Generated Video Quality assessment (UGVQ) model, designed to accurately evaluate the multi-dimensional quality of AIGC videos.
arXiv Detail & Related papers (2024-07-31T07:54:26Z) - CLIPVQA:Video Quality Assessment via CLIP [56.94085651315878]
We propose an efficient CLIP-based Transformer method for the VQA problem ( CLIPVQA)
The proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods.
arXiv Detail & Related papers (2024-07-06T02:32:28Z) - Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap [4.922783970210658]
We categorize the assessment of AIGC video quality into three dimensions: visual harmony, video-text consistency, and domain distribution gap.
For each dimension, we design specific modules to provide a comprehensive quality assessment of AIGC videos.
Our research identifies significant variations in visual quality, fluidity, and style among videos generated by different text-to-video models.
arXiv Detail & Related papers (2024-04-21T08:27:20Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Exploring Opinion-unaware Video Quality Assessment with Semantic
Affinity Criterion [52.07084862209754]
We introduce an explicit semantic affinity index for opinion-unaware VQA using text-prompts in the contrastive language-image pre-training model.
We also aggregate it with different traditional low-level naturalness indexes through gaussian normalization and sigmoid rescaling strategies.
The proposed Blind Unified Opinion-Unaware Video Quality Index via Semantic and Technical Metric Aggregation (BUONA-VISTA) outperforms existing opinion-unaware VQA methods by at least 20% improvements.
arXiv Detail & Related papers (2023-02-26T08:46:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.