VideoClusterNet: Self-Supervised and Adaptive Clustering For Videos
- URL: http://arxiv.org/abs/2407.12214v1
- Date: Tue, 16 Jul 2024 23:34:55 GMT
- Title: VideoClusterNet: Self-Supervised and Adaptive Clustering For Videos
- Authors: Devesh Walawalkar, Pablo Garrido,
- Abstract summary: Video Face Clustering aims to group together detected video face tracks with common facial identities.
This problem is very challenging due to the large range of pose, expression, appearance, and lighting variations of a given face across video frames.
We present a novel video face clustering approach that learns to adapt a generic face ID model to new video face tracks in a fully self-supervised fashion.
- Score: 2.0719478063181027
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With the rise of digital media content production, the need for analyzing movies and TV series episodes to locate the main cast of characters precisely is gaining importance.Specifically, Video Face Clustering aims to group together detected video face tracks with common facial identities. This problem is very challenging due to the large range of pose, expression, appearance, and lighting variations of a given face across video frames. Generic pre-trained Face Identification (ID) models fail to adapt well to the video production domain, given its high dynamic range content and also unique cinematic style. Furthermore, traditional clustering algorithms depend on hyperparameters requiring individual tuning across datasets. In this paper, we present a novel video face clustering approach that learns to adapt a generic face ID model to new video face tracks in a fully self-supervised fashion. We also propose a parameter-free clustering algorithm that is capable of automatically adapting to the finetuned model's embedding space for any input video. Due to the lack of comprehensive movie face clustering benchmarks, we also present a first-of-kind movie dataset: MovieFaceCluster. Our dataset is handpicked by film industry professionals and contains extremely challenging face ID scenarios. Experiments show our method's effectiveness in handling difficult mainstream movie scenes on our benchmark dataset and state-of-the-art performance on traditional TV series datasets.
Related papers
- CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects.
We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z) - Perceptual Quality Assessment of Face Video Compression: A Benchmark and
An Effective Method [69.868145936998]
Generative coding approaches have been identified as promising alternatives with reasonable perceptual rate-distortion trade-offs.
The great diversity of distortion types in spatial and temporal domains, ranging from the traditional hybrid coding frameworks to generative models, present grand challenges in compressed face video quality assessment (VQA)
We introduce the large-scale Compressed Face Video Quality Assessment (CFVQA) database, which is the first attempt to systematically understand the perceptual quality and diversified compression distortions in face videos.
arXiv Detail & Related papers (2023-04-14T11:26:09Z) - Self-supervised Video-centralised Transformer for Video Face Clustering [58.12996668434134]
This paper presents a novel method for face clustering in videos using a video-centralised transformer.
We release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering.
arXiv Detail & Related papers (2022-03-24T16:38:54Z) - Seq-Masks: Bridging the gap between appearance and gait modeling for
video-based person re-identification [10.490428828061292]
ideo-based person re-identification (Re-ID) aims to match person images in video sequences captured by disjoint surveillance cameras.
Traditional video-based person Re-ID methods focus on exploring appearance information, thus, vulnerable against illumination changes, scene noises, camera parameters, and especially clothes/carrying variations.
We propose a framework that utilizes the sequence masks (SeqMasks) in the video to integrate appearance information and gait modeling in a close fashion.
arXiv Detail & Related papers (2021-12-10T16:00:20Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z) - Face, Body, Voice: Video Person-Clustering with Multiple Modalities [85.0282742801264]
Previous methods focus on the narrower task of face-clustering.
Most current datasets evaluate only the task of face-clustering, rather than person-clustering.
We introduce a Video Person-Clustering dataset, for evaluating multi-modal person-clustering.
arXiv Detail & Related papers (2021-05-20T17:59:40Z) - Coherent Loss: A Generic Framework for Stable Video Segmentation [103.78087255807482]
We investigate how a jittering artifact degrades the visual quality of video segmentation results.
We propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts.
arXiv Detail & Related papers (2020-10-25T10:48:28Z) - Self-attention aggregation network for video face representation and
recognition [0.0]
We propose a new model architecture for video face representation and recognition based on a self-attention mechanism.
Our approach could be used for video with single and multiple identities.
arXiv Detail & Related papers (2020-10-11T20:57:46Z) - Robust Character Labeling in Movie Videos: Data Resources and
Self-supervised Feature Adaptation [39.373699774220775]
We present a dataset of over 169,000 face tracks curated from 240 Hollywood movies with weak labels.
We propose an offline algorithm based on nearest-neighbor search in the embedding space to mine hard-examples from these tracks.
Overall, we find that multiview correlation-based adaptation yields more discriminative and robust face embeddings.
arXiv Detail & Related papers (2020-08-25T22:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.