A strong baseline for image and video quality assessment
- URL: http://arxiv.org/abs/2111.07104v1
- Date: Sat, 13 Nov 2021 12:24:08 GMT
- Title: A strong baseline for image and video quality assessment
- Authors: Shaoguo Wen, Junle Wang
- Abstract summary: We present a simple yet effective unified model for perceptual quality assessment of image and video.
Our model achieves a comparable performance by applying only one global feature derived from a backbone network.
Based on the architecture proposed, we release the models well trained for three common real-world scenarios.
- Score: 4.73466728067544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we present a simple yet effective unified model for perceptual
quality assessment of image and video. In contrast to existing models which
usually consist of complex network architecture, or rely on the concatenation
of multiple branches of features, our model achieves a comparable performance
by applying only one global feature derived from a backbone network (i.e.
resnet18 in the presented work). Combined with some training tricks, the
proposed model surpasses the current baselines of SOTA models on public and
private datasets. Based on the architecture proposed, we release the models
well trained for three common real-world scenarios: UGC videos in the wild, PGC
videos with compression, Game videos with compression. These three pre-trained
models can be directly applied for quality assessment, or be further fine-tuned
for more customized usages. All the code, SDK, and the pre-trained weights of
the proposed models are publicly available at
https://github.com/Tencent/CenseoQoE.
Related papers
- VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model [34.35449902855767]
Two fundamental questions are what data we use for training and how to ensure multi-view consistency.
We propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models.
Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-18T17:48:15Z) - Moonshot: Towards Controllable Video Generation and Editing with
Multimodal Conditions [94.03133100056372]
Moonshot is a new video generation model that conditions simultaneously on multimodal inputs of image and text.
Model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing.
arXiv Detail & Related papers (2024-01-03T16:43:47Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - VideoGPT: Video Generation using VQ-VAE and Transformers [75.20543171520565]
VideoGG is a conceptually simple architecture for scaling likelihood based generative modeling to natural videos.
VideoG uses VQ-E that learns downsampled discrete latent representations by employing 3D convolutions and axial self-attention.
Our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the B-101 Robot dataset.
arXiv Detail & Related papers (2021-04-20T17:58:03Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Learning Generative Models of Textured 3D Meshes from Real-World Images [26.353307246909417]
We propose a GAN framework for generating textured triangle meshes without relying on such annotations.
We show that the performance of our approach is on par with prior work that relies on ground-truth keypoints.
arXiv Detail & Related papers (2021-03-29T14:07:37Z) - Unified Image and Video Saliency Modeling [21.701431656717112]
We ask: Can image and video saliency modeling be approached via a unified model?
We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors.
We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data.
We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
arXiv Detail & Related papers (2020-03-11T18:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.