Related papers: Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

URL: http://arxiv.org/abs/2602.22659v1
Date: Thu, 26 Feb 2026 06:18:11 GMT
Title: Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing
Authors: Renyu Yang, Jian Jin, Lili Meng, Meiqin Liu, Yilin Wang, Balu Adsumilli, Weisi Lin,
Abstract summary: We propose a practical approach for AVQA dataset construction.<n>We design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings, and achieves reliable annotation across varied environments.<n>We validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video sequences.
Score: 62.250874651622574
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at https://github.com/renyu12/YT-NTU-AVQ

Related papers

Q-Save: Towards Scoring and Attribution for Generated Video Evaluation [65.83319736145869]
We present Q-Save, a new benchmark dataset and model for holistic evaluation of AI-generated video (AIGV) quality.<n>The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels.<n>We propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation.
arXiv Detail & Related papers (2025-11-24T07:00:21Z)
CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video [9.172799792564009]
We propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large models.<n>Our approach introduces a quality-aware video metadata mechanism that integrates key fragments extracted from inter-frame variations.<n>Our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations.
arXiv Detail & Related papers (2025-11-10T16:37:47Z)
Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video [6.117081165682988]
We construct a dataset of omnidirectional audio and video (A/V) content.<n>A subjective AVQA experiment is conducted on the dataset to obtain the Mean Opinion Scores.<n>We construct an effective AVQA baseline model on the proposed dataset.
arXiv Detail & Related papers (2025-06-12T03:40:30Z)
Towards Generalized Video Quality Assessment: A Weak-to-Strong Learning Paradigm [76.63001244080313]
Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception.<n>The dominant VQA paradigm relies on supervised training with human-labeled datasets.<n>We explore weak-to-strong (W2S) learning as a new paradigm for advancing VQA without reliance on large-scale human-labeled datasets.
arXiv Detail & Related papers (2025-05-06T15:29:32Z)
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning [31.61978841892981]
We introduce a novel dataset, FortisAVQA, constructed in two stages.<n>The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation.<n>Our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81%.
arXiv Detail & Related papers (2025-04-01T07:23:50Z)
Video Quality Assessment: A Comprehensive Survey [55.734935003021576]
Video quality assessment (VQA) is an important processing task, aiming at predicting the quality of videos in a manner consistent with human judgments of perceived quality.<n>We present a survey of recent progress in the development of VQA algorithms and the benchmarking studies and databases that make them possible.
arXiv Detail & Related papers (2024-12-04T05:25:17Z)
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering [25.577314828249897]
We propose a novel dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the test split of a public dataset (MUSIC-AVQA) and introducing distribution shifts to split questions.<n> Experimental results show that this architecture achieves state-of-the-art performance on MUSIC-AVQA-R, notably obtaining a significant improvement of 9.32%.
arXiv Detail & Related papers (2024-04-18T09:16:02Z)
AQUALLM: Audio Question Answering Data Generation Using Large Language Models [2.2232550112727267]
We introduce a scalable AQA data generation pipeline, which relies on Large Language Models (LLMs) We present three extensive and high-quality benchmark datasets for AQA. Models trained on our datasets demonstrate enhanced generalizability when compared to models trained using human-annotated AQA data.
arXiv Detail & Related papers (2023-12-28T20:01:27Z)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA. We first augment the existing data via deliberate perturbations on either the image or question. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models [71.06007696593704]
Blind quality assessment (BVQA) plays an indispensable role in monitoring and improving the end-users' viewing experience in real-world video-enabled media applications. As an experimental field, the improvements of BVQA models have been measured primarily on a few human-rated VQA datasets. We conduct a first-of-its-kind computational analysis of VQA datasets via minimalistic BVQA models.
arXiv Detail & Related papers (2023-07-26T06:38:33Z)
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.