Related papers: Automated Video Labelling: Identifying Faces by Corroborative Evidence

Automated Video Labelling: Identifying Faces by Corroborative Evidence

URL: http://arxiv.org/abs/2102.05645v1
Date: Wed, 10 Feb 2021 18:57:52 GMT
Title: Automated Video Labelling: Identifying Faces by Corroborative Evidence
Authors: Andrew Brown, Ernesto Coto, Andrew Zisserman
Abstract summary: We present a method for automatically labelling all faces in video archives, such as TV broadcasts, by combining multiple evidence sources and multiple modalities. We provide a novel, simple, method for determining if a person is famous or not using image-search engines. We show that even for less-famous people, image-search engines can be used for corroborative evidence to accurately label faces that are named in the scene or the speech.
Score: 79.44208317138784
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a method for automatically labelling all faces in video archives, such as TV broadcasts, by combining multiple evidence sources and multiple modalities (visual and audio). We target the problem of ever-growing online video archives, where an effective, scalable indexing solution cannot require a user to provide manual annotation or supervision. To this end, we make three key contributions: (1) We provide a novel, simple, method for determining if a person is famous or not using image-search engines. In turn this enables a face-identity model to be built reliably and robustly, and used for high precision automatic labelling; (2) We show that even for less-famous people, image-search engines can then be used for corroborative evidence to accurately label faces that are named in the scene or the speech; (3) Finally, we quantitatively demonstrate the benefits of our approach on different video domains and test settings, such as TV shows and news broadcasts. Our method works across three disparate datasets without any explicit domain adaptation, and sets new state-of-the-art results on all the public benchmarks.

Related papers

The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning [89.64905703368255]
We propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences.
arXiv Detail & Related papers (2025-03-31T03:00:19Z)
Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos. We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z)
VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis [40.869862603815875]
VLOGGER is a method for audio-driven human video generation from a single input image. We use a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. We show applications in video editing and personalization.
arXiv Detail & Related papers (2024-03-13T17:59:02Z)
Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling [62.25533750469467]
We propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs. We envision this system being useful for the automatic generation of subtitles to improve the accessibility of videos available on modern streaming services.
arXiv Detail & Related papers (2024-01-22T15:26:01Z)
Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features [0.0]
Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This paper explores privacy-compliant group-level emotion recognition ''in-the-wild'' within the EmotiW Challenge 2023.
arXiv Detail & Related papers (2023-12-06T08:58:11Z)
Active Learning for Video Classification with Frame Level Queries [13.135234328352885]
We propose a novel active learning framework for video classification. Our framework identifies a batch of exemplar videos, together with a set of informative frames for each video. This involves much less manual work than watching the complete video to come up with a label.
arXiv Detail & Related papers (2023-07-10T15:47:13Z)
Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world. We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity. Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z)
Automatic Generation of Descriptive Titles for Video Clips Using Deep Learning [2.724141845301679]
We are proposing an architecture that utilizes image/video captioning methods and Natural Language Processing systems to generate a title and a concise abstract for a video. Such a system can potentially be utilized in many application domains, including, the cinema industry, video search engines, security surveillance, video databases/warehouses, data centers, and others.
arXiv Detail & Related papers (2021-04-07T18:14:18Z)
Face Forensics in the Wild [121.23154918448618]
We construct a novel large-scale dataset, called FFIW-10K, which comprises 10,000 high-quality forgery videos. The manipulation procedure is fully automatic, controlled by a domain-adversarial quality assessment network. In addition, we propose a novel algorithm to tackle the task of multi-person face forgery detection.
arXiv Detail & Related papers (2021-03-30T05:06:19Z)
Self-attention aggregation network for video face representation and recognition [0.0]
We propose a new model architecture for video face representation and recognition based on a self-attention mechanism. Our approach could be used for video with single and multiple identities.
arXiv Detail & Related papers (2020-10-11T20:57:46Z)
Generalized Few-Shot Video Classification with Video Retrieval and Feature Generation [132.82884193921535]
We argue that previous methods underestimate the importance of video feature learning and propose a two-stage approach. We show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks. We present two novel approaches that yield further improvement.
arXiv Detail & Related papers (2020-07-09T13:05:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.