VFHQ: A High-Quality Dataset and Benchmark for Video Face
Super-Resolution
- URL: http://arxiv.org/abs/2205.03409v1
- Date: Fri, 6 May 2022 16:31:57 GMT
- Title: VFHQ: A High-Quality Dataset and Benchmark for Video Face
Super-Resolution
- Authors: Liangbin Xie. Xintao Wang, Honglun Zhang, Chao Dong, Ying Shan
- Abstract summary: We develop an automatic and scalable pipeline to collect a high-quality video face dataset (VFHQ)
VFHQ contains over $16,000$ high-fidelity clips of diverse interview scenarios.
We show that the temporal information plays a pivotal role in eliminating video consistency issues.
- Score: 22.236432686296233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most of the existing video face super-resolution (VFSR) methods are trained
and evaluated on VoxCeleb1, which is designed specifically for speaker
identification and the frames in this dataset are of low quality. As a
consequence, the VFSR models trained on this dataset can not output
visual-pleasing results. In this paper, we develop an automatic and scalable
pipeline to collect a high-quality video face dataset (VFHQ), which contains
over $16,000$ high-fidelity clips of diverse interview scenarios. To verify the
necessity of VFHQ, we further conduct experiments and demonstrate that VFSR
models trained on our VFHQ dataset can generate results with sharper edges and
finer textures than those trained on VoxCeleb1. In addition, we show that the
temporal information plays a pivotal role in eliminating video consistency
issues as well as further improving visual performance. Based on VFHQ, by
analyzing the benchmarking study of several state-of-the-art algorithms under
bicubic and blind settings. See our project page:
https://liangbinxie.github.io/projects/vfhq
Related papers
- CLIPVQA:Video Quality Assessment via CLIP [56.94085651315878]
We propose an efficient CLIP-based Transformer method for the VQA problem ( CLIPVQA)
The proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods.
arXiv Detail & Related papers (2024-07-06T02:32:28Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Bridging High-Quality Audio and Video via Language for Sound Effects
Retrieval from Visual Queries [18.224608377111533]
Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task.
We propose a framework for recommending HQ SFX given a video frame.
We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data.
arXiv Detail & Related papers (2023-08-17T16:38:30Z) - Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for
Enhanced Video Forgery Detection [19.432851794777754]
We present a novel approach for the detection of deepfake videos using a pair of vision transformers pre-trained by a self-supervised masked autoencoding setup.
Our method consists of two distinct components, one of which focuses on learning spatial information from individual RGB frames of the video, while the other learns temporal consistency information from optical flow fields generated from consecutive frames.
arXiv Detail & Related papers (2023-06-12T05:49:23Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - FAST-VQA: Efficient End-to-end Video Quality Assessment with Fragment
Sampling [54.31355080688127]
Current deep video quality assessment (VQA) methods are usually with high computational costs when evaluating high-resolution videos.
We propose Grid Mini-patch Sampling (GMS), which allows consideration of local quality by sampling patches at their raw resolution.
We build the Fragment Attention Network (FANet) specially designed to accommodate fragments as inputs.
FAST-VQA improves state-of-the-art accuracy by around 10% while reducing 99.5% FLOPs on 1080P high-resolution videos.
arXiv Detail & Related papers (2022-07-06T11:11:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.