CelebV-HQ: A Large-Scale Video Facial Attributes Dataset
- URL: http://arxiv.org/abs/2207.12393v1
- Date: Mon, 25 Jul 2022 17:57:07 GMT
- Title: CelebV-HQ: A Large-Scale Video Facial Attributes Dataset
- Authors: Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang,
Ziwei Liu, Chen Change Loy
- Abstract summary: CelebV-HQ contains 35,666 video clips with the resolution of 512x512 at least, involving 15,653 identities.
We conduct a comprehensive analysis in terms of age, ethnicity, brightness stability, motion smoothness, head pose diversity, and data quality.
Its versatility and potential are validated on two representative tasks, i.e., unconditional video generation and video facial attribute editing.
- Score: 94.31308012569062
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large-scale datasets have played indispensable roles in the recent success of
face generation/editing and significantly facilitated the advances of emerging
research fields. However, the academic community still lacks a video dataset
with diverse facial attribute annotations, which is crucial for the research on
face-related videos. In this work, we propose a large-scale, high-quality, and
diverse video dataset with rich facial attribute annotations, named the
High-Quality Celebrity Video Dataset (CelebV-HQ). CelebV-HQ contains 35,666
video clips with the resolution of 512x512 at least, involving 15,653
identities. All clips are labeled manually with 83 facial attributes, covering
appearance, action, and emotion. We conduct a comprehensive analysis in terms
of age, ethnicity, brightness stability, motion smoothness, head pose
diversity, and data quality to demonstrate the diversity and temporal coherence
of CelebV-HQ. Besides, its versatility and potential are validated on two
representative tasks, i.e., unconditional video generation and video facial
attribute editing. Furthermore, we envision the future potential of CelebV-HQ,
as well as the new opportunities and challenges it would bring to related
research directions. Data, code, and models are publicly available. Project
page: https://celebv-hq.github.io.
Related papers
- FaVChat: Unlocking Fine-Grained Facial Video Understanding with Multimodal Large Language Models [12.029771909598647]
FaVChat is the first VMLLM specifically designed for fine-grained facial video understanding.
We construct a large-scale facial video dataset comprising over 60k videos, with the majority annotated with 83 fine-grained facial attributes.
We employ a progressive training paradigm, transitioning from video summarization to a high-quality subset of video QA, gradually increasing task complexity to enhance the model's fine-grained visual perception.
arXiv Detail & Related papers (2025-03-12T08:33:46Z) - MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation [62.85764872989189]
There is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models.
We present MovieBench: A Hierarchical Movie-Level dataset for Long Video Generation.
The dataset will be public and continuously maintained, aiming to advance the field of long video generation.
arXiv Detail & Related papers (2024-11-22T10:25:08Z) - FaceVid-1K: A Large-Scale High-Quality Multiracial Human Face Video Dataset [15.917564646478628]
We create a high-quality multiracial face collection named textbfFaceVid-1K.
We conduct experiments with several well-established video generation models, including text-to-video, image-to-video, and unconditional video generation.
We obtain the corresponding performance benchmarks and compared them with those trained on public datasets to demonstrate the superiority of our dataset.
arXiv Detail & Related papers (2024-09-23T07:27:02Z) - CelebV-Text: A Large-Scale Facial Text-Video Dataset [91.22496444328151]
CelebV-Text is a large-scale, diverse, and high-quality dataset of facial text-video pairs.
CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy.
The superiority of CelebV-Text over other datasets is demonstrated via comprehensive statistical analysis of the videos, texts, and text-video relevance.
arXiv Detail & Related papers (2023-03-26T13:06:35Z) - NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory [92.98552727430483]
Narrations-as-Queries (NaQ) is a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model.
NaQ improves multiple top models by substantial margins (even doubling their accuracy)
We also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.
arXiv Detail & Related papers (2023-01-02T16:40:15Z) - MEVID: Multi-view Extended Videos with Identities for Video Person
Re-Identification [17.72434646703505]
We present the Multi-view Extended Videos with Identities (MEVID) dataset for large-scale, video person re-identification (ReID) in the wild.
We label the identities of 158 unique people wearing 598 outfits taken from 8, 092 tracklets, average length of about 590 frames.
Being based on the MEVA video dataset, we also inherit data that is intentionally demographically balanced to the continental United States.
arXiv Detail & Related papers (2022-11-09T03:07:31Z) - Multiface: A Dataset for Neural Face Rendering [108.44505415073579]
In this work, we present Multiface, a new multi-view, high-resolution human face dataset.
We introduce Mugsy, a large scale multi-camera apparatus to capture high-resolution synchronized videos of a facial performance.
The goal of Multiface is to close the gap in accessibility to high quality data in the academic community and to enable research in VR telepresence.
arXiv Detail & Related papers (2022-07-22T17:55:39Z) - Video Person Re-identification using Attribute-enhanced Features [49.68392018281875]
We propose a novel network architecture named Attribute Salience Assisted Network (ASA-Net) for attribute-assisted video person Re-ID.
To learn a better separation of the target from background, we propose to learn the visual attention from middle-level attribute instead of high-level identities.
arXiv Detail & Related papers (2021-08-16T07:41:27Z) - Robust Character Labeling in Movie Videos: Data Resources and
Self-supervised Feature Adaptation [39.373699774220775]
We present a dataset of over 169,000 face tracks curated from 240 Hollywood movies with weak labels.
We propose an offline algorithm based on nearest-neighbor search in the embedding space to mine hard-examples from these tracks.
Overall, we find that multiview correlation-based adaptation yields more discriminative and robust face embeddings.
arXiv Detail & Related papers (2020-08-25T22:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.