Related papers: VideoClusterNet: Self-Supervised and Adaptive Clustering For Videos

VideoClusterNet: Self-Supervised and Adaptive Clustering For Videos

URL: http://arxiv.org/abs/2407.12214v1
Date: Tue, 16 Jul 2024 23:34:55 GMT
Title: VideoClusterNet: Self-Supervised and Adaptive Clustering For Videos
Authors: Devesh Walawalkar, Pablo Garrido,
Abstract summary: Video Face Clustering aims to group together detected video face tracks with common facial identities. This problem is very challenging due to the large range of pose, expression, appearance, and lighting variations of a given face across video frames. We present a novel video face clustering approach that learns to adapt a generic face ID model to new video face tracks in a fully self-supervised fashion.
Score: 2.0719478063181027
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: With the rise of digital media content production, the need for analyzing movies and TV series episodes to locate the main cast of characters precisely is gaining importance.Specifically, Video Face Clustering aims to group together detected video face tracks with common facial identities. This problem is very challenging due to the large range of pose, expression, appearance, and lighting variations of a given face across video frames. Generic pre-trained Face Identification (ID) models fail to adapt well to the video production domain, given its high dynamic range content and also unique cinematic style. Furthermore, traditional clustering algorithms depend on hyperparameters requiring individual tuning across datasets. In this paper, we present a novel video face clustering approach that learns to adapt a generic face ID model to new video face tracks in a fully self-supervised fashion. We also propose a parameter-free clustering algorithm that is capable of automatically adapting to the finetuned model's embedding space for any input video. Due to the lack of comprehensive movie face clustering benchmarks, we also present a first-of-kind movie dataset: MovieFaceCluster. Our dataset is handpicked by film industry professionals and contains extremely challenging face ID scenarios. Experiments show our method's effectiveness in handling difficult mainstream movie scenes on our benchmark dataset and state-of-the-art performance on traditional TV series datasets.

Related papers

Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness [6.634133253472436]
This paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. We also present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task.
arXiv Detail & Related papers (2025-01-14T09:52:56Z)
Multi-subject Open-set Personalization in Video Generation [110.02124633005516]
We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt. Our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-01-10T18:59:54Z)
VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping [43.30061680192465]
We present the first diffusion-based framework specifically designed for video face swapping. Our approach incorporates a specially designed diffusion model coupled with a VidFaceVAE. Our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods.
arXiv Detail & Related papers (2024-12-15T18:58:32Z)
FaceVid-1K: A Large-Scale High-Quality Multiracial Human Face Video Dataset [15.917564646478628]
We create a high-quality multiracial face collection named textbfFaceVid-1K. We conduct experiments with several well-established video generation models, including text-to-video, image-to-video, and unconditional video generation. We obtain the corresponding performance benchmarks and compared them with those trained on public datasets to demonstrate the superiority of our dataset.
arXiv Detail & Related papers (2024-09-23T07:27:02Z)
Kalman-Inspired Feature Propagation for Video Face Super-Resolution [78.84881180336744]
We introduce a novel framework to maintain a stable face prior to time. The Kalman filtering principles offer our method a recurrent ability to use the information from previously restored frames to guide and regulate the restoration process of the current frame. Experiments demonstrate the effectiveness of our method in capturing facial details consistently across video frames.
arXiv Detail & Related papers (2024-08-09T17:57:12Z)
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z)
Perceptual Quality Assessment of Face Video Compression: A Benchmark and An Effective Method [69.868145936998]
Generative coding approaches have been identified as promising alternatives with reasonable perceptual rate-distortion trade-offs. The great diversity of distortion types in spatial and temporal domains, ranging from the traditional hybrid coding frameworks to generative models, present grand challenges in compressed face video quality assessment (VQA) We introduce the large-scale Compressed Face Video Quality Assessment (CFVQA) database, which is the first attempt to systematically understand the perceptual quality and diversified compression distortions in face videos.
arXiv Detail & Related papers (2023-04-14T11:26:09Z)
Self-supervised Video-centralised Transformer for Video Face Clustering [58.12996668434134]
This paper presents a novel method for face clustering in videos using a video-centralised transformer. We release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering.
arXiv Detail & Related papers (2022-03-24T16:38:54Z)
Image-to-Video Generation via 3D Facial Dynamics [78.01476554323179]
We present a versatile model, FaceAnime, for various video generation tasks from still images. Our model is versatile for various AR/VR and entertainment applications, such as face video and face video prediction.
arXiv Detail & Related papers (2021-05-31T02:30:11Z)
Face, Body, Voice: Video Person-Clustering with Multiple Modalities [85.0282742801264]
Previous methods focus on the narrower task of face-clustering. Most current datasets evaluate only the task of face-clustering, rather than person-clustering. We introduce a Video Person-Clustering dataset, for evaluating multi-modal person-clustering.
arXiv Detail & Related papers (2021-05-20T17:59:40Z)
Self-attention aggregation network for video face representation and recognition [0.0]
We propose a new model architecture for video face representation and recognition based on a self-attention mechanism. Our approach could be used for video with single and multiple identities.
arXiv Detail & Related papers (2020-10-11T20:57:46Z)
Robust Character Labeling in Movie Videos: Data Resources and Self-supervised Feature Adaptation [39.373699774220775]
We present a dataset of over 169,000 face tracks curated from 240 Hollywood movies with weak labels. We propose an offline algorithm based on nearest-neighbor search in the embedding space to mine hard-examples from these tracks. Overall, we find that multiview correlation-based adaptation yields more discriminative and robust face embeddings.
arXiv Detail & Related papers (2020-08-25T22:07:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.