StarVQA: Space-Time Attention for Video Quality Assessment
- URL: http://arxiv.org/abs/2108.09635v1
- Date: Sun, 22 Aug 2021 04:53:02 GMT
- Title: StarVQA: Space-Time Attention for Video Quality Assessment
- Authors: Fengchuang Xing, Yuan-Gen Wang, Hanpin Wang, Leida Li, and Guopu Zhu
- Abstract summary: evaluating the quality of in-the-wild videos is challenging due to the unknown of pristine reference and shooting distortion.
This paper presents a novel.
underlinespace-underlinetime underlineattention network founderliner underlineVQA problem, named StarVQA.
- Score: 28.3487798060932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The attention mechanism is blooming in computer vision nowadays. However, its
application to video quality assessment (VQA) has not been reported. Evaluating
the quality of in-the-wild videos is challenging due to the unknown of pristine
reference and shooting distortion. This paper presents a novel
\underline{s}pace-\underline{t}ime \underline{a}ttention network
fo\underline{r} the \underline{VQA} problem, named StarVQA. StarVQA builds a
Transformer by alternately concatenating the divided space-time attention. To
adapt the Transformer architecture for training, StarVQA designs a vectorized
regression loss by encoding the mean opinion score (MOS) to the probability
vector and embedding a special vectorized label token as the learnable
variable. To capture the long-range spatiotemporal dependencies of a video
sequence, StarVQA encodes the space-time position information of each patch to
the input of the Transformer. Various experiments are conducted on the de-facto
in-the-wild video datasets, including LIVE-VQC, KoNViD-1k, LSVQ, and
LSVQ-1080p. Experimental results demonstrate the superiority of the proposed
StarVQA over the state-of-the-art. Code and model will be available at:
https://github.com/DVL/StarVQA.
Related papers
- Can I Trust Your Answer? Visually Grounded Video Question Answering [88.11169242115416]
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding.
We construct NExT-GQA -- an extension of NExT-QA with 10.5$K$ temporal grounding labels tied to the original QA pairs.
arXiv Detail & Related papers (2023-09-04T03:06:04Z) - Capturing Co-existing Distortions in User-Generated Content for
No-reference Video Quality Assessment [9.883856205077022]
Video Quality Assessment (VQA) aims to predict the perceptual quality of a video.
VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos.
We propose textitVisual Quality Transformer (VQT) to extract quality-related sparse features more efficiently.
arXiv Detail & Related papers (2023-07-31T16:29:29Z) - StarVQA+: Co-training Space-Time Attention for Video Quality Assessment [56.548364244708715]
Self-attention based Transformer has achieved great success in many computer vision tasks.
However, its application to video quality assessment (VQA) has not been satisfactory so far.
This paper presents a co-trained Space-Time Attention network for the VQA problem, termed StarVQA+.
arXiv Detail & Related papers (2023-06-21T14:27:31Z) - Contrastive Video Question Answering via Video Graph Transformer [184.3679515511028]
We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner.
CoVGT's uniqueness and superiority are three-fold.
We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
arXiv Detail & Related papers (2023-02-27T11:09:13Z) - DCVQE: A Hierarchical Transformer for Video Quality Assessment [3.700565386929641]
We propose a Divide and Conquer Video Quality Estimator (DCVQE) for NR-VQA.
We call this hierarchical combination of Transformers as a Divide and Conquer Transformer (DCTr) layer.
Taking the order relationship among the annotated data into account, we also propose a novel correlation loss term for model training.
arXiv Detail & Related papers (2022-10-10T00:22:16Z) - FAST-VQA: Efficient End-to-end Video Quality Assessment with Fragment
Sampling [54.31355080688127]
Current deep video quality assessment (VQA) methods are usually with high computational costs when evaluating high-resolution videos.
We propose Grid Mini-patch Sampling (GMS), which allows consideration of local quality by sampling patches at their raw resolution.
We build the Fragment Attention Network (FANet) specially designed to accommodate fragments as inputs.
FAST-VQA improves state-of-the-art accuracy by around 10% while reducing 99.5% FLOPs on 1080P high-resolution videos.
arXiv Detail & Related papers (2022-07-06T11:11:43Z) - Video Super-Resolution Transformer [85.11270760456826]
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem.
Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling.
In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
arXiv Detail & Related papers (2021-06-12T20:00:32Z) - End-to-End Video Question-Answer Generation with Generator-Pretester
Network [27.31969951281815]
We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia.
As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG)
We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
arXiv Detail & Related papers (2021-01-05T10:46:06Z) - UGC-VQA: Benchmarking Blind Video Quality Assessment for User Generated
Content [59.13821614689478]
Blind quality prediction of in-the-wild videos is quite challenging, since the quality degradations of content are unpredictable, complicated, and often commingled.
Here we contribute to advancing the problem by conducting a comprehensive evaluation of leading VQA models.
By employing a feature selection strategy on top of leading VQA model features, we are able to extract 60 of the 763 statistical features used by the leading models.
Our experimental results show that VIDEVAL achieves state-of-theart performance at considerably lower computational cost than other leading models.
arXiv Detail & Related papers (2020-05-29T00:39:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.