Face, Body, Voice: Video Person-Clustering with Multiple Modalities
- URL: http://arxiv.org/abs/2105.09939v1
- Date: Thu, 20 May 2021 17:59:40 GMT
- Title: Face, Body, Voice: Video Person-Clustering with Multiple Modalities
- Authors: Andrew Brown, Vicky Kalogeiton, Andrew Zisserman
- Abstract summary: Previous methods focus on the narrower task of face-clustering.
Most current datasets evaluate only the task of face-clustering, rather than person-clustering.
We introduce a Video Person-Clustering dataset, for evaluating multi-modal person-clustering.
- Score: 85.0282742801264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The objective of this work is person-clustering in videos -- grouping
characters according to their identity. Previous methods focus on the narrower
task of face-clustering, and for the most part ignore other cues such as the
person's voice, their overall appearance (hair, clothes, posture), and the
editing structure of the videos. Similarly, most current datasets evaluate only
the task of face-clustering, rather than person-clustering. This limits their
applicability to downstream applications such as story understanding which
require person-level, rather than only face-level, reasoning. In this paper we
make contributions to address both these deficiencies: first, we introduce a
Multi-Modal High-Precision Clustering algorithm for person-clustering in videos
using cues from several modalities (face, body, and voice). Second, we
introduce a Video Person-Clustering dataset, for evaluating multi-modal
person-clustering. It contains body-tracks for each annotated character,
face-tracks when visible, and voice-tracks when speaking, with their associated
features. The dataset is by far the largest of its kind, and covers films and
TV-shows representing a wide range of demographics. Finally, we show the
effectiveness of using multiple modalities for person-clustering, explore the
use of this new broad task for story understanding through character
co-occurrences, and achieve a new state of the art on all available datasets
for face and person-clustering.
Related papers
- VideoClusterNet: Self-Supervised and Adaptive Face Clustering For Videos [2.0719478063181027]
Video Face Clustering aims to group together detected video face tracks with common facial identities.
This problem is very challenging due to the large range of pose, expression, appearance, and lighting variations of a given face across video frames.
We present a novel video face clustering approach that learns to adapt a generic face ID model to new video face tracks in a fully self-supervised fashion.
arXiv Detail & Related papers (2024-07-16T23:34:55Z) - Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering [8.447067012487866]
Multi-MaP is a novel method employing a multi-modal proxy learning process.
It captures a user's interest via a keyword but also facilitates identifying relevant clusterings.
Our experiments show that Multi-MaP consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks.
arXiv Detail & Related papers (2024-04-24T05:20:42Z) - Unified and Dynamic Graph for Temporal Character Grouping in Long Videos [31.192044026127032]
Video temporal character grouping locates appearing moments of major characters within a video according to their identities.
Recent works have evolved from unsupervised clustering to graph-based supervised clustering.
We present a unified and dynamic graph (UniDG) framework for temporal character grouping.
arXiv Detail & Related papers (2023-08-27T13:22:55Z) - Relation-Aware Distribution Representation Network for Person Clustering
with Multiple Modalities [17.569843539515734]
Person clustering with multi-modal clues, including faces, bodies, and voices, is critical for various tasks.
We propose a Relation-Aware Distribution representation Network (RAD-Net) to generate a distribution representation for multi-modal clues.
Our method achieves substantial improvements of +6% and +8.2% in F-score on the Video Person-Clustering dataset.
arXiv Detail & Related papers (2023-08-01T15:04:56Z) - GOCA: Guided Online Cluster Assignment for Self-Supervised Video
Representation Learning [49.69279760597111]
Clustering is a ubiquitous tool in unsupervised learning.
Most of the existing self-supervised representation learning methods typically cluster samples based on visually dominant features.
We propose a principled way to combine two views. Specifically, we propose a novel clustering strategy where we use the initial cluster assignment of each view as prior to guide the final cluster assignment of the other view.
arXiv Detail & Related papers (2022-07-20T19:26:55Z) - Human Instance Segmentation and Tracking via Data Association and
Single-stage Detector [17.46922710432633]
Human video instance segmentation plays an important role in computer understanding of human activities.
Most current VIS methods are based on Mask-RCNN framework.
We develop a new method for human video instance segmentation based on single-stage detector.
arXiv Detail & Related papers (2022-03-31T11:36:09Z) - Self-supervised Video-centralised Transformer for Video Face Clustering [58.12996668434134]
This paper presents a novel method for face clustering in videos using a video-centralised transformer.
We release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering.
arXiv Detail & Related papers (2022-03-24T16:38:54Z) - Clustering by Maximizing Mutual Information Across Views [62.21716612888669]
We propose a novel framework for image clustering that incorporates joint representation learning and clustering.
Our method significantly outperforms state-of-the-art single-stage clustering methods across a variety of image datasets.
arXiv Detail & Related papers (2021-07-24T15:36:49Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.