VideoClusterNet: Self-Supervised and Adaptive Face Clustering For Videos
- URL: http://arxiv.org/abs/2407.12214v2
- Date: Wed, 18 Sep 2024 16:18:29 GMT
- Title: VideoClusterNet: Self-Supervised and Adaptive Face Clustering For Videos
- Authors: Devesh Walawalkar, Pablo Garrido,
- Abstract summary: Video Face Clustering aims to group together detected video face tracks with common facial identities.
This problem is very challenging due to the large range of pose, expression, appearance, and lighting variations of a given face across video frames.
We present a novel video face clustering approach that learns to adapt a generic face ID model to new video face tracks in a fully self-supervised fashion.
- Score: 2.0719478063181027
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With the rise of digital media content production, the need for analyzing movies and TV series episodes to locate the main cast of characters precisely is gaining importance.Specifically, Video Face Clustering aims to group together detected video face tracks with common facial identities. This problem is very challenging due to the large range of pose, expression, appearance, and lighting variations of a given face across video frames. Generic pre-trained Face Identification (ID) models fail to adapt well to the video production domain, given its high dynamic range content and also unique cinematic style. Furthermore, traditional clustering algorithms depend on hyperparameters requiring individual tuning across datasets. In this paper, we present a novel video face clustering approach that learns to adapt a generic face ID model to new video face tracks in a fully self-supervised fashion. We also propose a parameter-free clustering algorithm that is capable of automatically adapting to the finetuned model's embedding space for any input video. Due to the lack of comprehensive movie face clustering benchmarks, we also present a first-of-kind movie dataset: MovieFaceCluster. Our dataset is handpicked by film industry professionals and contains extremely challenging face ID scenarios. Experiments show our method's effectiveness in handling difficult mainstream movie scenes on our benchmark dataset and state-of-the-art performance on traditional TV series datasets.
Related papers
- FaceVid-1K: A Large-Scale High-Quality Multiracial Human Face Video Dataset [15.917564646478628]
We create a high-quality multiracial face collection named textbfFaceVid-1K.
We conduct experiments with several well-established video generation models, including text-to-video, image-to-video, and unconditional video generation.
We obtain the corresponding performance benchmarks and compared them with those trained on public datasets to demonstrate the superiority of our dataset.
arXiv Detail & Related papers (2024-09-23T07:27:02Z) - Kalman-Inspired Feature Propagation for Video Face Super-Resolution [78.84881180336744]
We introduce a novel framework to maintain a stable face prior to time.
The Kalman filtering principles offer our method a recurrent ability to use the information from previously restored frames to guide and regulate the restoration process of the current frame.
Experiments demonstrate the effectiveness of our method in capturing facial details consistently across video frames.
arXiv Detail & Related papers (2024-08-09T17:57:12Z) - CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects.
We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z) - Perceptual Quality Assessment of Face Video Compression: A Benchmark and
An Effective Method [69.868145936998]
Generative coding approaches have been identified as promising alternatives with reasonable perceptual rate-distortion trade-offs.
The great diversity of distortion types in spatial and temporal domains, ranging from the traditional hybrid coding frameworks to generative models, present grand challenges in compressed face video quality assessment (VQA)
We introduce the large-scale Compressed Face Video Quality Assessment (CFVQA) database, which is the first attempt to systematically understand the perceptual quality and diversified compression distortions in face videos.
arXiv Detail & Related papers (2023-04-14T11:26:09Z) - Self-supervised Video-centralised Transformer for Video Face Clustering [58.12996668434134]
This paper presents a novel method for face clustering in videos using a video-centralised transformer.
We release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering.
arXiv Detail & Related papers (2022-03-24T16:38:54Z) - Image-to-Video Generation via 3D Facial Dynamics [78.01476554323179]
We present a versatile model, FaceAnime, for various video generation tasks from still images.
Our model is versatile for various AR/VR and entertainment applications, such as face video and face video prediction.
arXiv Detail & Related papers (2021-05-31T02:30:11Z) - Face, Body, Voice: Video Person-Clustering with Multiple Modalities [85.0282742801264]
Previous methods focus on the narrower task of face-clustering.
Most current datasets evaluate only the task of face-clustering, rather than person-clustering.
We introduce a Video Person-Clustering dataset, for evaluating multi-modal person-clustering.
arXiv Detail & Related papers (2021-05-20T17:59:40Z) - Self-attention aggregation network for video face representation and
recognition [0.0]
We propose a new model architecture for video face representation and recognition based on a self-attention mechanism.
Our approach could be used for video with single and multiple identities.
arXiv Detail & Related papers (2020-10-11T20:57:46Z) - Robust Character Labeling in Movie Videos: Data Resources and
Self-supervised Feature Adaptation [39.373699774220775]
We present a dataset of over 169,000 face tracks curated from 240 Hollywood movies with weak labels.
We propose an offline algorithm based on nearest-neighbor search in the embedding space to mine hard-examples from these tracks.
Overall, we find that multiview correlation-based adaptation yields more discriminative and robust face embeddings.
arXiv Detail & Related papers (2020-08-25T22:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.