Robust Character Labeling in Movie Videos: Data Resources and
Self-supervised Feature Adaptation
- URL: http://arxiv.org/abs/2008.11289v2
- Date: Fri, 25 Feb 2022 23:18:30 GMT
- Title: Robust Character Labeling in Movie Videos: Data Resources and
Self-supervised Feature Adaptation
- Authors: Krishna Somandepalli, Rajat Hebbar, Shrikanth Narayanan
- Abstract summary: We present a dataset of over 169,000 face tracks curated from 240 Hollywood movies with weak labels.
We propose an offline algorithm based on nearest-neighbor search in the embedding space to mine hard-examples from these tracks.
Overall, we find that multiview correlation-based adaptation yields more discriminative and robust face embeddings.
- Score: 39.373699774220775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robust face clustering is a vital step in enabling computational
understanding of visual character portrayal in media. Face clustering for
long-form content is challenging because of variations in appearance and lack
of supporting large-scale labeled data. Our work in this paper focuses on two
key aspects of this problem: the lack of domain-specific training or benchmark
datasets, and adapting face embeddings learned on web images to long-form
content, specifically movies. First, we present a dataset of over 169,000 face
tracks curated from 240 Hollywood movies with weak labels on whether a pair of
face tracks belong to the same or a different character. We propose an offline
algorithm based on nearest-neighbor search in the embedding space to mine
hard-examples from these tracks. We then investigate triplet-loss and multiview
correlation-based methods for adapting face embeddings to hard-examples. Our
experimental results highlight the usefulness of weakly labeled data for
domain-specific feature adaptation. Overall, we find that multiview
correlation-based adaptation yields more discriminative and robust face
embeddings. Its performance on downstream face verification and clustering
tasks is comparable to that of the state-of-the-art results in this domain. We
also present the SAIL-Movie Character Benchmark corpus developed to augment
existing benchmarks. It consists of racially diverse actors and provides
face-quality labels for subsequent error analysis. We hope that the large-scale
datasets developed in this work can further advance automatic character
labeling in videos. All resources are available freely at
https://sail.usc.edu/~ccmi/multiface.
Related papers
- VideoClusterNet: Self-Supervised and Adaptive Face Clustering For Videos [2.0719478063181027]
Video Face Clustering aims to group together detected video face tracks with common facial identities.
This problem is very challenging due to the large range of pose, expression, appearance, and lighting variations of a given face across video frames.
We present a novel video face clustering approach that learns to adapt a generic face ID model to new video face tracks in a fully self-supervised fashion.
arXiv Detail & Related papers (2024-07-16T23:34:55Z) - Visual Data-Type Understanding does not emerge from Scaling
Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification.
An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z) - Is this Harmful? Learning to Predict Harmfulness Ratings from Video [15.059547998989537]
We create a dataset of approximately 4000 video clips, annotated by professionals in the field.
We conduct an in-depth study on our modeling choices and find that we greatly benefit from combining the visual and audio modality.
Our dataset will be made available upon publication.
arXiv Detail & Related papers (2021-06-15T17:57:12Z) - Face, Body, Voice: Video Person-Clustering with Multiple Modalities [85.0282742801264]
Previous methods focus on the narrower task of face-clustering.
Most current datasets evaluate only the task of face-clustering, rather than person-clustering.
We introduce a Video Person-Clustering dataset, for evaluating multi-modal person-clustering.
arXiv Detail & Related papers (2021-05-20T17:59:40Z) - Learning Multi-Granular Hypergraphs for Video-Based Person
Re-Identification [110.52328716130022]
Video-based person re-identification (re-ID) is an important research topic in computer vision.
We propose a novel graph-based framework, namely Multi-Granular Hypergraph (MGH) to better representational capabilities.
90.0% top-1 accuracy on MARS is achieved using MGH, outperforming the state-of-the-arts schemes.
arXiv Detail & Related papers (2021-04-30T11:20:02Z) - Connecting Images through Time and Sources: Introducing Low-data,
Heterogeneous Instance Retrieval [3.6526118822907594]
We show that it is not trivial to pick features responding well to a panel of variations and semantic content.
Introducing a new enhanced version of the Alegoria benchmark, we compare descriptors using the detailed annotations.
arXiv Detail & Related papers (2021-03-19T10:54:51Z) - CharacterGAN: Few-Shot Keypoint Character Animation and Reposing [64.19520387536741]
We introduce CharacterGAN, a generative model that can be trained on only a few samples of a given character.
Our model generates novel poses based on keypoint locations, which can be modified in real time while providing interactive feedback.
We show that our approach outperforms recent baselines and creates realistic animations for diverse characters.
arXiv Detail & Related papers (2021-02-05T12:38:15Z) - Red Carpet to Fight Club: Partially-supervised Domain Transfer for Face
Recognition in Violent Videos [12.534785814117065]
We introduce the WildestFaces dataset to study cross-domain recognition under a variety of adverse conditions.
We establish a rigorous evaluation protocol for this clean-to-violent recognition task, and present a detailed analysis of the proposed dataset and the methods.
arXiv Detail & Related papers (2020-09-16T09:45:33Z) - Labelling unlabelled videos from scratch with multi-modal
self-supervision [82.60652426371936]
unsupervised labelling of a video dataset does not come for free from strong feature encoders.
We propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations.
An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels.
arXiv Detail & Related papers (2020-06-24T12:28:17Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.