Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video
- URL: http://arxiv.org/abs/2506.10331v1
- Date: Thu, 12 Jun 2025 03:40:30 GMT
- Title: Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video
- Authors: Fei Zhao, Da Pan, Zelu Qi, Ping Shi,
- Abstract summary: We construct a dataset of omnidirectional audio and video (A/V) content.<n>A subjective AVQA experiment is conducted on the dataset to obtain the Mean Opinion Scores.<n>We construct an effective AVQA baseline model on the proposed dataset.
- Score: 6.117081165682988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In response to the rising prominence of the Metaverse, omnidirectional videos (ODVs) have garnered notable interest, gradually shifting from professional-generated content (PGC) to user-generated content (UGC). However, the study of audio-visual quality assessment (AVQA) within ODVs remains limited. To address this, we construct a dataset of UGC omnidirectional audio and video (A/V) content. The videos are captured by five individuals using two different types of omnidirectional cameras, shooting 300 videos covering 10 different scene types. A subjective AVQA experiment is conducted on the dataset to obtain the Mean Opinion Scores (MOSs) of the A/V sequences. After that, to facilitate the development of UGC-ODV AVQA fields, we construct an effective AVQA baseline model on the proposed dataset, of which the baseline model consists of video feature extraction module, audio feature extraction and audio-visual fusion module. The experimental results demonstrate that our model achieves optimal performance on the proposed dataset.
Related papers
- DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor [22.35724335601674]
Video Quality Assessment (VQA) aims to evaluate video quality based on perceptual distortions and human preferences.<n>We introduce a novel VQA framework, DiffVQA, which harnesses the robust generalization capabilities of diffusion models pre-trained on extensive datasets.
arXiv Detail & Related papers (2025-05-06T07:42:24Z) - How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model [50.15552768350462]
This paper comprehensively investigates audio-visual attention in omnidirectional videos (ODVs) from both subjective and objective perspectives.<n>To advance the research on audio-visual saliency prediction for ODVs, we establish a new benchmark based on the AVS-ODV database.
arXiv Detail & Related papers (2024-08-10T02:45:46Z) - CLIPVQA:Video Quality Assessment via CLIP [56.94085651315878]
We propose an efficient CLIP-based Transformer method for the VQA problem ( CLIPVQA)
The proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods.
arXiv Detail & Related papers (2024-07-06T02:32:28Z) - Audio-visual Saliency for Omnidirectional Videos [58.086575606742116]
We first establish the largest audio-visual saliency dataset for omnidirectional videos (AVS-ODV)
We analyze the visual attention behavior of the observers under various omnidirectional audio modalities and visual scenes based on the AVS-ODV dataset.
arXiv Detail & Related papers (2023-11-09T08:03:40Z) - AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection [49.81915942821647]
This study introduces the audio-visual transformer-based ensemble network (AVTENet) to detect deepfake videos.<n>For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset.<n>For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - Multimodal Variational Auto-encoder based Audio-Visual Segmentation [46.67599800471001]
ECMVAE factorizes the representations of each modality with a modality-shared representation and a modality-specific representation.
Our approach leads to a new state-of-the-art for audio-visual segmentation, with a 3.84 mIOU performance leap.
arXiv Detail & Related papers (2023-10-12T13:09:40Z) - Perceptual Quality Assessment of Omnidirectional Audio-visual Signals [37.73157112698111]
Most existing quality assessment studies for omnidirectional videos (ODVs) only focus on the visual distortions of videos.
In this paper, we first establish a large-scale audio-visual quality assessment dataset for ODVs.
Then, we design three baseline methods for full-reference omnidirectional audio-visual quality assessment (OAVQA)
arXiv Detail & Related papers (2023-07-20T12:21:26Z) - Audio-Visual Quality Assessment for User Generated Content: Database and
Method [61.970768267688086]
Most existing VQA studies only focus on the visual distortions of videos, ignoring that the user's QoE also depends on the accompanying audio signals.
We construct the first AVQA database named the SJTU-UAV database, which includes 520 in-the-wild audio and video (A/V) sequences.
We also design a family of AVQA models, which fuse the popular VQA methods and audio features via support vector regressor (SVR)
The experimental results show that with the help of audio signals, the VQA models can evaluate the quality more accurately.
arXiv Detail & Related papers (2023-03-04T11:49:42Z) - A study on joint modeling and data augmentation of multi-modalities for
audio-visual scene classification [64.59834310846516]
We propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC)
Our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
arXiv Detail & Related papers (2022-03-07T07:29:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.