LSTM-based Video Quality Prediction Accounting for Temporal Distortions
in Videoconferencing Calls
- URL: http://arxiv.org/abs/2303.12761v1
- Date: Wed, 22 Mar 2023 17:14:38 GMT
- Title: LSTM-based Video Quality Prediction Accounting for Temporal Distortions
in Videoconferencing Calls
- Authors: Gabriel Mittag, Babak Naderi, Vishak Gopal, Ross Cutler
- Abstract summary: We present a data-driven approach for modeling such distortions automatically by training an LSTM with subjective quality ratings labeled via crowdsourcing.
We applied QR codes as markers on the source videos to create aligned references and compute temporal features based on the alignment vectors.
Our proposed model achieves a PCC of 0.99 on the validation set and gives detailed insight into the cause of video quality impairments.
- Score: 22.579711841384764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current state-of-the-art video quality models, such as VMAF, give excellent
prediction results by comparing the degraded video with its reference video.
However, they do not consider temporal distortions (e.g., frame freezes or
skips) that occur during videoconferencing calls. In this paper, we present a
data-driven approach for modeling such distortions automatically by training an
LSTM with subjective quality ratings labeled via crowdsourcing. The videos were
collected from live videoconferencing calls in 83 different network conditions.
We applied QR codes as markers on the source videos to create aligned
references and compute temporal features based on the alignment vectors. Using
these features together with VMAF core features, our proposed model achieves a
PCC of 0.99 on the validation set. Furthermore, our model outputs per-frame
quality that gives detailed insight into the cause of video quality
impairments. The VCM model and dataset are open-sourced at
https://github.com/microsoft/Video_Call_MOS.
Related papers
- Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors [54.8852848659663]
Buffer Anytime is a framework for estimation of depth and normal maps (which we call geometric buffers) from video.
We demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints.
arXiv Detail & Related papers (2024-11-26T09:28:32Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.
Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.
We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Video Demoireing with Relation-Based Temporal Consistency [68.20281109859998]
Moire patterns, appearing as color distortions, severely degrade image and video qualities when filming a screen with digital cameras.
We study how to remove such undesirable moire patterns in videos, namely video demoireing.
arXiv Detail & Related papers (2022-04-06T17:45:38Z) - ChipQA: No-Reference Video Quality Prediction via Space-Time Chips [33.12375264668551]
We propose a new model for no-reference video quality assessment (VQA)
Our approach uses a new idea of highly-localized space-time slices called Space-Time Chips (ST Chips)
We show that our model achieves state-of-the-art performance at reduced cost, without requiring motion computation.
arXiv Detail & Related papers (2021-09-17T19:16:31Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z) - Noisy-LSTM: Improving Temporal Awareness for Video Semantic Segmentation [29.00635219317848]
This paper presents a new model named Noisy-LSTM, which is trainable in an end-to-end manner.
We also present a simple yet effective training strategy, which replaces a frame in video sequence with noises.
arXiv Detail & Related papers (2020-10-19T13:08:15Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - NAViDAd: A No-Reference Audio-Visual Quality Metric Based on a Deep
Autoencoder [0.0]
We propose a No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd)
The model is formed by a 2-layer framework that includes a deep autoencoder layer and a classification layer.
The model performed well when tested against the UnB-AV and the LiveNetflix-II databases.
arXiv Detail & Related papers (2020-01-30T15:40:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.