DisCoVQA: Temporal Distortion-Content Transformers for Video Quality
Assessment
- URL: http://arxiv.org/abs/2206.09853v1
- Date: Mon, 20 Jun 2022 15:31:27 GMT
- Title: DisCoVQA: Temporal Distortion-Content Transformers for Video Quality
Assessment
- Authors: Haoning Wu, Chaofeng Chen, Liang Liao, Jingwen Hou, Wenxiu Sun, Qiong
Yan, Weisi Lin
- Abstract summary: Some temporal variations are causing temporal distortions and lead to extra quality degradations.
Human visual system often has different attention to frames with different contents.
We propose a novel and effective transformer-based VQA method to tackle these two issues.
- Score: 56.42140467085586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The temporal relationships between frames and their influences on video
quality assessment (VQA) are still under-studied in existing works. These
relationships lead to two important types of effects for video quality.
Firstly, some temporal variations (such as shaking, flicker, and abrupt scene
transitions) are causing temporal distortions and lead to extra quality
degradations, while other variations (e.g. those related to meaningful
happenings) do not. Secondly, the human visual system often has different
attention to frames with different contents, resulting in their different
importance to the overall video quality. Based on prominent time-series
modeling ability of transformers, we propose a novel and effective
transformer-based VQA method to tackle these two issues. To better
differentiate temporal variations and thus capture the temporal distortions, we
design a transformer-based Spatial-Temporal Distortion Extraction (STDE)
module. To tackle with temporal quality attention, we propose the
encoder-decoder-like temporal content transformer (TCT). We also introduce the
temporal sampling on features to reduce the input length for the TCT, so as to
improve the learning effectiveness and efficiency of this module. Consisting of
the STDE and the TCT, the proposed Temporal Distortion-Content Transformers for
Video Quality Assessment (DisCoVQA) reaches state-of-the-art performance on
several VQA benchmarks without any extra pre-training datasets and up to 10%
better generalization ability than existing methods. We also conduct extensive
ablation experiments to prove the effectiveness of each part in our proposed
model, and provide visualizations to prove that the proposed modules achieve
our intention on modeling these temporal issues. We will publish our codes and
pretrained weights later.
Related papers
- Video Quality Assessment Based on Swin TransformerV2 and Coarse to Fine
Strategy [16.436012370209845]
objective of non-reference quality assessment is to evaluate the quality of distorted video without access to high-definition references.
In this study, we introduce an enhanced spatial perception module, pre-trained on multiple image quality assessment datasets, and a lightweight temporal fusion module.
arXiv Detail & Related papers (2024-01-16T17:33:54Z) - Capturing Co-existing Distortions in User-Generated Content for
No-reference Video Quality Assessment [9.883856205077022]
Video Quality Assessment (VQA) aims to predict the perceptual quality of a video.
VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos.
We propose textitVisual Quality Transformer (VQT) to extract quality-related sparse features more efficiently.
arXiv Detail & Related papers (2023-07-31T16:29:29Z) - Saliency-Aware Spatio-Temporal Artifact Detection for Compressed Video
Quality Assessment [16.49357671290058]
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Temporal Artifacts (PEAs)
In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality.
Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed.
arXiv Detail & Related papers (2023-01-03T12:48:27Z) - End-to-end Transformer for Compressed Video Quality Enhancement [21.967066471073462]
We propose a transformer-based compressed video quality enhancement (TVQE) method, consisting of Swin-AutoEncoder based Spatio-Temporal feature Fusion (SSTF) module and Channel-wise Attention based Quality Enhancement (CAQE) module.
Our proposed method outperforms existing ones in terms of both inference speed and GPU consumption.
arXiv Detail & Related papers (2022-10-25T08:12:05Z) - Neighbourhood Representative Sampling for Efficient End-to-end Video
Quality Assessment [60.57703721744873]
The increased resolution of real-world videos presents a dilemma between efficiency and accuracy for deep Video Quality Assessment (VQA)
In this work, we propose a unified scheme, spatial-temporal grid mini-cube sampling (St-GMS) to get a novel type of sample, named fragments.
With fragments and FANet, the proposed efficient end-to-end FAST-VQA and FasterVQA achieve significantly better performance than existing approaches on all VQA benchmarks.
arXiv Detail & Related papers (2022-10-11T11:38:07Z) - Exploring the Effectiveness of Video Perceptual Representation in Blind
Video Quality Assessment [55.65173181828863]
We propose a temporal perceptual quality index (TPQI) to measure the temporal distortion by describing the graphic morphology of the representation.
Experiments show that TPQI is an effective way of predicting subjective temporal quality.
arXiv Detail & Related papers (2022-07-08T07:30:51Z) - Making Video Quality Assessment Models Sensitive to Frame Rate
Distortions [63.749184706461826]
We consider the problem of capturing distortions arising from changes in frame rate as part of Video Quality Assessment (VQA)
We propose a simple fusion framework, whereby temporal features from GREED are combined with existing VQA models.
Our results suggest that employing efficient temporal representations can result much more robust and accurate VQA models.
arXiv Detail & Related papers (2022-05-21T04:13:57Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - ST-GREED: Space-Time Generalized Entropic Differences for Frame Rate
Dependent Video Quality Prediction [63.749184706461826]
We study how perceptual quality is affected by frame rate, and how frame rate and compression combine to affect perceived quality.
We devise an objective VQA model called Space-Time GeneRalized Entropic Difference (GREED) which analyzes the statistics of spatial and temporal band-pass video coefficients.
GREED achieves state-of-the-art performance on the LIVE-YT-HFR Database when compared with existing VQA models.
arXiv Detail & Related papers (2020-10-26T16:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.