Look Who's Talking: Active Speaker Detection in the Wild
- URL: http://arxiv.org/abs/2108.07640v1
- Date: Tue, 17 Aug 2021 14:16:56 GMT
- Title: Look Who's Talking: Active Speaker Detection in the Wild
- Authors: You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon,
Bong-Jin Lee, Youngki Kwon, Joon Son Chung
- Abstract summary: We present a novel audio-visual dataset for active speaker detection in the wild.
Active Speakers in the Wild (ASW) dataset contains videos and co-occurring speech segments with dense speech activity labels.
Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way.
- Score: 30.22352874520012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we present a novel audio-visual dataset for active speaker
detection in the wild. A speaker is considered active when his or her face is
visible and the voice is audible simultaneously. Although active speaker
detection is a crucial pre-processing step for many audio-visual tasks, there
is no existing dataset of natural human speech to evaluate the performance of
active speaker detection. We therefore curate the Active Speakers in the Wild
(ASW) dataset which contains videos and co-occurring speech segments with dense
speech activity labels. Videos and timestamps of audible segments are parsed
and adopted from VoxConverse, an existing speaker diarisation dataset that
consists of videos in the wild. Face tracks are extracted from the videos and
active segments are annotated based on the timestamps of VoxConverse in a
semi-automatic way. Two reference systems, a self-supervised system and a fully
supervised one, are evaluated on the dataset to provide the baseline
performances of ASW. Cross-domain evaluation is conducted in order to show the
negative effect of dubbed videos in the training data.
Related papers
- Audio-Visual Talker Localization in Video for Spatial Sound Reproduction [3.2472293599354596]
In this research, we detect and locate the active speaker in the video.
We found the role of the two modalities to complement each other.
Future investigations will assess the robustness of the model in noisy and highly reverberant environments.
arXiv Detail & Related papers (2024-06-01T16:47:07Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Audio-Visual Activity Guided Cross-Modal Identity Association for Active
Speaker Detection [37.28070242751129]
Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality.
We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection.
arXiv Detail & Related papers (2022-12-01T14:46:00Z) - Unsupervised active speaker detection in media content using cross-modal
information [37.28070242751129]
We present a cross-modal unsupervised framework for active speaker detection in media content such as TV shows and movies.
We leverage speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task.
We show competitive performance to state-of-the-art fully supervised methods.
arXiv Detail & Related papers (2022-09-24T00:51:38Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Spot the conversation: speaker diarisation in the wild [108.61222789195209]
We propose an automatic audio-visual diarisation method for YouTube videos.
Second, we integrate our method into a semi-automatic dataset creation pipeline.
Third, we use this pipeline to create a large-scale diarisation dataset called VoxConverse.
arXiv Detail & Related papers (2020-07-02T15:55:54Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - Cross modal video representations for weakly supervised active speaker
localization [39.67239953795999]
Cross-modal neural network for learning visual representations is presented.
We present a weakly supervised system for the task of localizing active speakers in movie content.
We also demonstrate state-of-the-art performance for the task of voice activity detection in an audio-visual framework.
arXiv Detail & Related papers (2020-03-09T18:50:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.