APES: Audiovisual Person Search in Untrimmed Video
- URL: http://arxiv.org/abs/2106.01667v1
- Date: Thu, 3 Jun 2021 08:16:42 GMT
- Title: APES: Audiovisual Person Search in Untrimmed Video
- Authors: Juan Leon Alcazar, Long Mai, Federico Perazzi, Joon-Young Lee, Pablo
Arbelaez, Bernard Ghanem, and Fabian Caba Heilbron
- Abstract summary: We present the Audiovisual Person Search dataset (APES)
APES contains over 1.9K identities labeled along 36 hours of video.
A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity.
- Score: 87.4124877066541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans are arguably one of the most important subjects in video streams, many
real-world applications such as video summarization or video editing workflows
often require the automatic search and retrieval of a person of interest.
Despite tremendous efforts in the person reidentification and retrieval
domains, few works have developed audiovisual search strategies. In this paper,
we present the Audiovisual Person Search dataset (APES), a new dataset composed
of untrimmed videos whose audio (voices) and visual (faces) streams are densely
annotated. APES contains over 1.9K identities labeled along 36 hours of video,
making it the largest dataset available for untrimmed audiovisual person
search. A key property of APES is that it includes dense temporal annotations
that link faces to speech segments of the same identity. To showcase the
potential of our new dataset, we propose an audiovisual baseline and benchmark
for person retrieval. Our study shows that modeling audiovisual cues benefits
the recognition of people's identities. To enable reproducibility and promote
future research, the dataset annotations and baseline code are available at:
https://github.com/fuankarion/audiovisual-person-search
Related papers
- Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Audiovisual Moments in Time: A Large-Scale Annotated Dataset of
Audiovisual Actions [1.1510009152620668]
We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events.
The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants.
arXiv Detail & Related papers (2023-08-18T17:13:45Z) - A Comprehensive Survey on Video Saliency Detection with Auditory
Information: the Audio-visual Consistency Perceptual is the Key! [25.436683033432086]
Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip.
This paper provides extensive review to bridge the gap between audio-visual fusion and saliency detection.
arXiv Detail & Related papers (2022-06-20T07:25:13Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.