Telepresence Video Quality Assessment
- URL: http://arxiv.org/abs/2207.09956v1
- Date: Wed, 20 Jul 2022 15:02:55 GMT
- Title: Telepresence Video Quality Assessment
- Authors: Zhenqiang Ying and Deepti Ghadiyaram and Alan Bovik
- Abstract summary: We create an online video quality prediction framework for live streaming, using a multi-modal learning framework with separate pathways to compute visual and audio quality predictions.
Our all-in-one model is able to provide accurate quality predictions at the patch, frame, clip, and audiovisual levels.
- Score: 13.417089780219326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video conferencing, which includes both video and audio content, has
contributed to dramatic increases in Internet traffic, as the COVID-19 pandemic
forced millions of people to work and learn from home. Global Internet traffic
of video conferencing has dramatically increased Because of this, efficient and
accurate video quality tools are needed to monitor and perceptually optimize
telepresence traffic streamed via Zoom, Webex, Meet, etc. However, existing
models are limited in their prediction capabilities on multi-modal, live
streaming telepresence content. Here we address the significant challenges of
Telepresence Video Quality Assessment (TVQA) in several ways. First, we
mitigated the dearth of subjectively labeled data by collecting ~2k
telepresence videos from different countries, on which we crowdsourced ~80k
subjective quality labels. Using this new resource, we created a
first-of-a-kind online video quality prediction framework for live streaming,
using a multi-modal learning framework with separate pathways to compute visual
and audio quality predictions. Our all-in-one model is able to provide accurate
quality predictions at the patch, frame, clip, and audiovisual levels. Our
model achieves state-of-the-art performance on both existing quality databases
and our new TVQA database, at a considerably lower computational expense,
making it an attractive solution for mobile and embedded systems.
Related papers
- Satellite Streaming Video QoE Prediction: A Real-World Subjective Database and Network-Level Prediction Models [59.061552498630874]
We introduce the LIVE-Viasat Real-World Satellite QoE Database.
This database consists of 179 videos recorded from real-world streaming services affected by various authentic distortion patterns.
We demonstrate the usefulness of this unique new resource by evaluating the efficacy of QoE-prediction models on it.
We also created a new model that maps the network parameters to predicted human perception scores, which can be used by ISPs to optimize the video streaming quality of their networks.
arXiv Detail & Related papers (2024-10-17T18:22:50Z) - CLIPVQA:Video Quality Assessment via CLIP [56.94085651315878]
We propose an efficient CLIP-based Transformer method for the VQA problem ( CLIPVQA)
The proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods.
arXiv Detail & Related papers (2024-07-06T02:32:28Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling [125.95527079960725]
Transformer-based models have achieved top performance on major video recognition benchmarks.
Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
arXiv Detail & Related papers (2022-08-25T17:59:00Z) - A Multimodal Framework for Video Ads Understanding [64.70769354696019]
We develop a multimodal system to improve the ability of structured analysis of advertising video content.
Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.
arXiv Detail & Related papers (2021-08-29T16:06:00Z) - RAPIQUE: Rapid and Accurate Video Quality Prediction of User Generated
Content [44.03188436272383]
We introduce an effective and efficient video quality model for content, which we dub the Rapid and Accurate Video Quality Evaluator (RAPIQUE)
RAPIQUE combines and leverages the advantages of both quality-aware scene statistics features and semantics-aware deep convolutional features.
Our experimental results on recent large-scale video quality databases show that RAPIQUE delivers top performances on all the datasets at a considerably lower computational expense.
arXiv Detail & Related papers (2021-01-26T17:23:46Z) - Patch-VQ: 'Patching Up' the Video Quality Problem [0.9786690381850356]
No-reference (NR) perceptual video quality assessment (VQA) is a complex, unsolved, and important problem to social and streaming media applications.
Current NR models are limited in their prediction capabilities on real-world, "in-the-wild" video data.
We create the largest (by far) subjective video quality dataset, containing 39, 000 realworld distorted videos and 117, 000 space-time localized video patches.
arXiv Detail & Related papers (2020-11-27T03:46:44Z) - Sound2Sight: Generating Visual Dynamics from Sound and Context [36.38300120482868]
We present Sound2Sight, a deep variational framework, that is trained to learn a per frame prior conditioned on a joint embedding of audio and past frames.
To improve the quality and coherence of the generated frames, we propose a multimodal discriminator.
Our experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality.
arXiv Detail & Related papers (2020-07-23T16:57:44Z) - NAViDAd: A No-Reference Audio-Visual Quality Metric Based on a Deep
Autoencoder [0.0]
We propose a No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd)
The model is formed by a 2-layer framework that includes a deep autoencoder layer and a classification layer.
The model performed well when tested against the UnB-AV and the LiveNetflix-II databases.
arXiv Detail & Related papers (2020-01-30T15:40:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.