Video Quality Assessment Based on Swin TransformerV2 and Coarse to Fine
Strategy
- URL: http://arxiv.org/abs/2401.08522v1
- Date: Tue, 16 Jan 2024 17:33:54 GMT
- Title: Video Quality Assessment Based on Swin TransformerV2 and Coarse to Fine
Strategy
- Authors: Zihao Yu, Fengbin Guan, Yiting Lu, Xin Li, Zhibo Chen
- Abstract summary: objective of non-reference quality assessment is to evaluate the quality of distorted video without access to high-definition references.
In this study, we introduce an enhanced spatial perception module, pre-trained on multiple image quality assessment datasets, and a lightweight temporal fusion module.
- Score: 16.436012370209845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The objective of non-reference video quality assessment is to evaluate the
quality of distorted video without access to reference high-definition
references. In this study, we introduce an enhanced spatial perception module,
pre-trained on multiple image quality assessment datasets, and a lightweight
temporal fusion module to address the no-reference visual quality assessment
(NR-VQA) task. This model implements Swin Transformer V2 as a local-level
spatial feature extractor and fuses these multi-stage representations through a
series of transformer layers. Furthermore, a temporal transformer is utilized
for spatiotemporal feature fusion across the video. To accommodate compressed
videos of varying bitrates, we incorporate a coarse-to-fine contrastive
strategy to enrich the model's capability to discriminate features from videos
of different bitrates. This is an expanded version of the one-page abstract.
Related papers
- Modular Blind Video Quality Assessment [33.657933680973194]
Blind video quality assessment (BVQA) plays a pivotal role in evaluating and improving the viewing experience of end-users across a wide range of video-based platforms and services.
In this paper, we propose a modular BVQA model and a method of training it to improve its modularity.
arXiv Detail & Related papers (2024-02-29T15:44:00Z) - Corner-to-Center Long-range Context Model for Efficient Learned Image
Compression [70.0411436929495]
In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations.
We propose the textbfCorner-to-Center transformer-based Context Model (C$3$M) designed to enhance context and latent predictions.
In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder.
arXiv Detail & Related papers (2023-11-29T21:40:28Z) - Progressive Learning with Visual Prompt Tuning for Variable-Rate Image
Compression [60.689646881479064]
We propose a progressive learning paradigm for transformer-based variable-rate image compression.
Inspired by visual prompt tuning, we use LPM to extract prompts for input images and hidden features at the encoder side and decoder side, respectively.
Our model outperforms all current variable image methods in terms of rate-distortion performance and approaches the state-of-the-art fixed image compression methods trained from scratch.
arXiv Detail & Related papers (2023-11-23T08:29:32Z) - Neighbourhood Representative Sampling for Efficient End-to-end Video
Quality Assessment [60.57703721744873]
The increased resolution of real-world videos presents a dilemma between efficiency and accuracy for deep Video Quality Assessment (VQA)
In this work, we propose a unified scheme, spatial-temporal grid mini-cube sampling (St-GMS) to get a novel type of sample, named fragments.
With fragments and FANet, the proposed efficient end-to-end FAST-VQA and FasterVQA achieve significantly better performance than existing approaches on all VQA benchmarks.
arXiv Detail & Related papers (2022-10-11T11:38:07Z) - DCVQE: A Hierarchical Transformer for Video Quality Assessment [3.700565386929641]
We propose a Divide and Conquer Video Quality Estimator (DCVQE) for NR-VQA.
We call this hierarchical combination of Transformers as a Divide and Conquer Transformer (DCTr) layer.
Taking the order relationship among the annotated data into account, we also propose a novel correlation loss term for model training.
arXiv Detail & Related papers (2022-10-10T00:22:16Z) - Time-Space Transformers for Video Panoptic Segmentation [3.2489082010225494]
We propose a solution that simultaneously predicts pixel-level semantic and clip-level instance segmentation.
Our network, named VPS-Transformer, combines a convolutional architecture for single-frame panoptic segmentation and a video module based on an instantiation of a pure Transformer block.
arXiv Detail & Related papers (2022-10-07T13:30:11Z) - DisCoVQA: Temporal Distortion-Content Transformers for Video Quality
Assessment [56.42140467085586]
Some temporal variations are causing temporal distortions and lead to extra quality degradations.
Human visual system often has different attention to frames with different contents.
We propose a novel and effective transformer-based VQA method to tackle these two issues.
arXiv Detail & Related papers (2022-06-20T15:31:27Z) - Towards End-to-End Image Compression and Analysis with Transformers [99.50111380056043]
We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application.
We aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer.
Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.
arXiv Detail & Related papers (2021-12-17T03:28:14Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - Learning Generalized Spatial-Temporal Deep Feature Representation for
No-Reference Video Quality Assessment [16.974008463660688]
We propose a no-reference video quality assessment method, aiming to achieve high-generalization capability in cross-content, -resolution and -frame rate quality prediction.
In particular, we evaluate the quality of a video by learning effective feature representations in spatial-temporal domain.
Experiments show that our method outperforms the state-of-the-art methods on cross-dataset settings.
arXiv Detail & Related papers (2020-12-27T13:11:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.