Self-supervised Video-centralised Transformer for Video Face Clustering
- URL: http://arxiv.org/abs/2203.13166v1
- Date: Thu, 24 Mar 2022 16:38:54 GMT
- Title: Self-supervised Video-centralised Transformer for Video Face Clustering
- Authors: Yujiang Wang, Mingzhi Dong, Jie Shen, Yiming Luo, Yiming Lin,
Pingchuan Ma, Stavros Petridis, Maja Pantic
- Abstract summary: This paper presents a novel method for face clustering in videos using a video-centralised transformer.
We release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering.
- Score: 58.12996668434134
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel method for face clustering in videos using a
video-centralised transformer. Previous works often employed contrastive
learning to learn frame-level representation and used average pooling to
aggregate the features along the temporal dimension. This approach may not
fully capture the complicated video dynamics. In addition, despite the recent
progress in video-based contrastive learning, few have attempted to learn a
self-supervised clustering-friendly face representation that benefits the video
face clustering task. To overcome these limitations, our method employs a
transformer to directly learn video-level representations that can better
reflect the temporally-varying property of faces in videos, while we also
propose a video-centralised self-supervised framework to train the transformer
model. We also investigate face clustering in egocentric videos, a
fast-emerging field that has not been studied yet in works related to face
clustering. To this end, we present and release the first large-scale
egocentric video face clustering dataset named EasyCom-Clustering. We evaluate
our proposed method on both the widely used Big Bang Theory (BBT) dataset and
the new EasyCom-Clustering dataset. Results show the performance of our
video-centralised transformer has surpassed all previous state-of-the-art
methods on both benchmarks, exhibiting a self-attentive understanding of face
videos.
Related papers
- VideoClusterNet: Self-Supervised and Adaptive Face Clustering For Videos [2.0719478063181027]
Video Face Clustering aims to group together detected video face tracks with common facial identities.
This problem is very challenging due to the large range of pose, expression, appearance, and lighting variations of a given face across video frames.
We present a novel video face clustering approach that learns to adapt a generic face ID model to new video face tracks in a fully self-supervised fashion.
arXiv Detail & Related papers (2024-07-16T23:34:55Z) - Multi-entity Video Transformers for Fine-Grained Video Representation
Learning [36.31020249963468]
We re-examine the design of transformer architectures for video representation learning.
A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline.
Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks.
arXiv Detail & Related papers (2023-11-17T21:23:12Z) - Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for
Enhanced Video Forgery Detection [19.432851794777754]
We present a novel approach for the detection of deepfake videos using a pair of vision transformers pre-trained by a self-supervised masked autoencoding setup.
Our method consists of two distinct components, one of which focuses on learning spatial information from individual RGB frames of the video, while the other learns temporal consistency information from optical flow fields generated from consecutive frames.
arXiv Detail & Related papers (2023-06-12T05:49:23Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Multiview Pseudo-Labeling for Semi-supervised Learning from Video [102.36355560553402]
We present a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.
Our method capitalizes on multiple views, but it nonetheless trains a model that is shared across appearance and motion input.
On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
arXiv Detail & Related papers (2021-04-01T17:59:48Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.