Self-attention aggregation network for video face representation and
recognition
- URL: http://arxiv.org/abs/2010.05340v1
- Date: Sun, 11 Oct 2020 20:57:46 GMT
- Title: Self-attention aggregation network for video face representation and
recognition
- Authors: Ihor Protsenko, Taras Lehinevych, Dmytro Voitekh, Ihor Kroosh, Nick
Hasty, Anthony Johnson
- Abstract summary: We propose a new model architecture for video face representation and recognition based on a self-attention mechanism.
Our approach could be used for video with single and multiple identities.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Models based on self-attention mechanisms have been successful in analyzing
temporal data and have been widely used in the natural language domain. We
propose a new model architecture for video face representation and recognition
based on a self-attention mechanism. Our approach could be used for video with
single and multiple identities. To the best of our knowledge, no one has
explored the aggregation approaches that consider the video with multiple
identities. The proposed approach utilizes existing models to get the face
representation for each video frame, e.g., ArcFace and MobileFaceNet, and the
aggregation module produces the aggregated face representation vector for video
by taking into consideration the order of frames and their quality scores. We
demonstrate empirical results on a public dataset for video face recognition
called IJB-C to indicate that the self-attention aggregation network (SAAN)
outperforms naive average pooling. Moreover, we introduce a new multi-identity
video dataset based on the publicly available UMDFaces dataset and collected
GIFs from Giphy. We show that SAAN is capable of producing a compact face
representation for both single and multiple identities in a video. The dataset
and source code will be publicly available.
Related papers
- VideoClusterNet: Self-Supervised and Adaptive Face Clustering For Videos [2.0719478063181027]
Video Face Clustering aims to group together detected video face tracks with common facial identities.
This problem is very challenging due to the large range of pose, expression, appearance, and lighting variations of a given face across video frames.
We present a novel video face clustering approach that learns to adapt a generic face ID model to new video face tracks in a fully self-supervised fashion.
arXiv Detail & Related papers (2024-07-16T23:34:55Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection [17.74528571088335]
We introduce MINTIME, a video deepfake detection approach that captures spatial and temporal anomalies and handles instances of multiple people in the same video and variations in face sizes.
It achieves state-of-the-art results on the ForgeryNet dataset with an improvement of up to 14% AUC in videos containing multiple people.
arXiv Detail & Related papers (2022-11-20T15:17:24Z) - Temporal Saliency Query Network for Efficient Video Recognition [82.52760040577864]
Video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices.
Most existing methods select the salient frames without awareness of the class-specific saliency scores.
We propose a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement.
arXiv Detail & Related papers (2022-07-21T09:23:34Z) - Self-supervised Video-centralised Transformer for Video Face Clustering [58.12996668434134]
This paper presents a novel method for face clustering in videos using a video-centralised transformer.
We release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering.
arXiv Detail & Related papers (2022-03-24T16:38:54Z) - Boosting Video Representation Learning with Multi-Faceted Integration [112.66127428372089]
Video content is multifaceted, consisting of objects, scenes, interactions or actions.
Existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset.
We propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content.
arXiv Detail & Related papers (2022-01-11T16:14:23Z) - Seq-Masks: Bridging the gap between appearance and gait modeling for
video-based person re-identification [10.490428828061292]
ideo-based person re-identification (Re-ID) aims to match person images in video sequences captured by disjoint surveillance cameras.
Traditional video-based person Re-ID methods focus on exploring appearance information, thus, vulnerable against illumination changes, scene noises, camera parameters, and especially clothes/carrying variations.
We propose a framework that utilizes the sequence masks (SeqMasks) in the video to integrate appearance information and gait modeling in a close fashion.
arXiv Detail & Related papers (2021-12-10T16:00:20Z) - Automated Video Labelling: Identifying Faces by Corroborative Evidence [79.44208317138784]
We present a method for automatically labelling all faces in video archives, such as TV broadcasts, by combining multiple evidence sources and multiple modalities.
We provide a novel, simple, method for determining if a person is famous or not using image-search engines.
We show that even for less-famous people, image-search engines can be used for corroborative evidence to accurately label faces that are named in the scene or the speech.
arXiv Detail & Related papers (2021-02-10T18:57:52Z) - GroupFace: Learning Latent Groups and Constructing Group-based
Representations for Face Recognition [20.407167858663453]
We propose a novel face-recognition-specialized architecture called GroupFace to improve the quality of the embedding feature.
The proposed method provides self-distributed labels that balance the number of samples belonging to each group without additional human annotations.
All the components of the proposed method can be trained in an end-to-end manner with a marginal increase of computational complexity.
arXiv Detail & Related papers (2020-05-21T07:30:34Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.