AVA-AVD: Audio-visual Speaker Diarization in the Wild
- URL: http://arxiv.org/abs/2111.14448v2
- Date: Wed, 1 Dec 2021 11:17:30 GMT
- Title: AVA-AVD: Audio-visual Speaker Diarization in the Wild
- Authors: Eric Zhongcong Xu, Zeyang Song, Chao Feng, Mang Ye, Mike Zheng Shou
- Abstract summary: Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios.
We propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility.
- Score: 26.97787596025907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-visual speaker diarization aims at detecting ``who spoken when`` using
both auditory and visual signals. Existing audio-visual diarization datasets
are mainly focused on indoor environments like meeting rooms or news studios,
which are quite different from in-the-wild videos in many scenarios such as
movies, documentaries, and audience sitcoms. To create a testbed that can
effectively compare diarization methods on videos in the wild, we annotate the
speaker diarization labels on the AVA movie dataset and create a new benchmark
called AVA-AVD. This benchmark is challenging due to the diverse scenes,
complicated acoustic conditions, and completely off-screen speakers. Yet, how
to deal with off-screen and on-screen speakers together still remains a
critical challenge. To overcome it, we propose a novel Audio-Visual Relation
Network (AVR-Net) which introduces an effective modality mask to capture
discriminative information based on visibility. Experiments have shown that our
method not only can outperform state-of-the-art methods but also is more robust
as varying the ratio of off-screen speakers. Ablation studies demonstrate the
advantages of the proposed AVR-Net and especially the modality mask on
diarization. Our data and code will be made publicly available at
https://github.com/zcxu-eric/AVA-AVD.
Related papers
- SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model [35.60147467774199]
SAV-SE is first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise.
To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance.
arXiv Detail & Related papers (2024-11-12T12:23:41Z) - Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos.
We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Late Audio-Visual Fusion for In-The-Wild Speaker Diarization [33.0046568984949]
We propose an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion.
For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset.
We also propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers.
arXiv Detail & Related papers (2022-11-02T17:20:42Z) - AVATAR: Unconstrained Audiovisual Speech Recognition [75.17253531162608]
We propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) trained end-to-end from spectrograms and full-frame RGB.
We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise.
We also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
arXiv Detail & Related papers (2022-06-15T17:33:19Z) - Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT [37.343431783936126]
This paper investigates self-supervised pre-training for audio-visual speaker representation learning.
A visual stream showing the speaker's mouth area is used alongside speech as inputs.
We conducted extensive experiments probing the effectiveness of pre-training and visual modality.
arXiv Detail & Related papers (2022-05-15T04:48:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.