Spot the conversation: speaker diarisation in the wild
- URL: http://arxiv.org/abs/2007.01216v3
- Date: Sun, 15 Aug 2021 04:01:36 GMT
- Title: Spot the conversation: speaker diarisation in the wild
- Authors: Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras,
Andrew Zisserman
- Abstract summary: We propose an automatic audio-visual diarisation method for YouTube videos.
Second, we integrate our method into a semi-automatic dataset creation pipeline.
Third, we use this pipeline to create a large-scale diarisation dataset called VoxConverse.
- Score: 108.61222789195209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this paper is speaker diarisation of videos collected 'in the
wild'. We make three key contributions. First, we propose an automatic
audio-visual diarisation method for YouTube videos. Our method consists of
active speaker detection using audio-visual methods and speaker verification
using self-enrolled speaker models. Second, we integrate our method into a
semi-automatic dataset creation pipeline which significantly reduces the number
of hours required to annotate videos with diarisation labels. Finally, we use
this pipeline to create a large-scale diarisation dataset called VoxConverse,
collected from 'in the wild' videos, which we will release publicly to the
research community. Our dataset consists of overlapping speech, a large and
diverse speaker pool, and challenging background conditions.
Related papers
- Audio-Visual Talker Localization in Video for Spatial Sound Reproduction [3.2472293599354596]
In this research, we detect and locate the active speaker in the video.
We found the role of the two modalities to complement each other.
Future investigations will assess the robustness of the model in noisy and highly reverberant environments.
arXiv Detail & Related papers (2024-06-01T16:47:07Z) - REWIND Dataset: Privacy-preserving Speaking Status Segmentation from
Multimodal Body Movement Signals in the Wild [14.5263556841263]
We present the first publicly available multimodal dataset with high-quality individual speech recordings of 33 subjects in a professional networking event.
In all cases we predict a 20Hz binary speaking status signal extracted from the audio, a time resolution not available in previous datasets.
arXiv Detail & Related papers (2024-03-02T15:14:58Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Audio-Driven Dubbing for User Generated Contents via Style-Aware
Semi-Parametric Synthesis [123.11530365315677]
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production.
In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production.
arXiv Detail & Related papers (2023-08-31T15:41:40Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos [54.08224321456871]
The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language.
The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model.
The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
arXiv Detail & Related papers (2022-06-09T14:15:37Z) - Look Who's Talking: Active Speaker Detection in the Wild [30.22352874520012]
We present a novel audio-visual dataset for active speaker detection in the wild.
Active Speakers in the Wild (ASW) dataset contains videos and co-occurring speech segments with dense speech activity labels.
Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way.
arXiv Detail & Related papers (2021-08-17T14:16:56Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.