WASD: A Wilder Active Speaker Detection Dataset
- URL: http://arxiv.org/abs/2303.05321v1
- Date: Thu, 9 Mar 2023 15:13:22 GMT
- Title: WASD: A Wilder Active Speaker Detection Dataset
- Authors: Tiago Roxo, Joana C. Costa, Pedro R. M. In\'acio, Hugo Proen\c{c}a
- Abstract summary: Current Active Speaker Detection (ASD) models achieve great results on AVA-ActiveSpeaker (AVA) using only sound and facial features.
We propose a Wilder Active Speaker Detection (WASD) dataset, with increased difficulty by targeting the two key components of current ASD: audio and face.
We select state-of-the-art models and assess their performance in two groups of WASD: Easy (cooperative settings) and Hard (audio and/or face are specifically degraded)
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Current Active Speaker Detection (ASD) models achieve great results on
AVA-ActiveSpeaker (AVA), using only sound and facial features. Although this
approach is applicable in movie setups (AVA), it is not suited for less
constrained conditions. To demonstrate this limitation, we propose a Wilder
Active Speaker Detection (WASD) dataset, with increased difficulty by targeting
the two key components of current ASD: audio and face. Grouped into 5
categories, ranging from optimal conditions to surveillance settings, WASD
contains incremental challenges for ASD with tactical impairment of audio and
face data. We select state-of-the-art models and assess their performance in
two groups of WASD: Easy (cooperative settings) and Hard (audio and/or face are
specifically degraded). The results show that: 1) AVA trained models maintain a
state-of-the-art performance in WASD Easy group, while underperforming in the
Hard one, showing the 2) similarity between AVA and Easy data; and 3) training
in WASD does not improve models performance to AVA levels, particularly for
audio impairment and surveillance settings. This shows that AVA does not
prepare models for wild ASD and current approaches are subpar to deal with such
conditions. The proposed dataset also contains body data annotations to provide
a new source for ASD, and is available at https://github.com/Tiago-Roxo/WASD.
Related papers
- Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining [21.26555178371168]
Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame.
Deep neural network-based models have shown good performance in this task.
We propose a causal, Self-Supervised Learning (SSL) pretraining framework to enhance TS-VAD performance in noisy conditions.
arXiv Detail & Related papers (2025-01-06T18:00:14Z) - ASDnB: Merging Face with Body Cues For Robust Active Speaker Detection [13.154512864498912]
We propose ASDnB, a model that singularly integrates face with body information.
Our approach splits 3D convolution into 2D and 1D to reduce computation cost without loss of performance.
arXiv Detail & Related papers (2024-12-11T18:12:06Z) - BIAS: A Body-based Interpretable Active Speaker Approach [13.154512864498912]
BIAS is a model that combines audio, face, and body information to accurately predict speakers in varying/challenging conditions.
Results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance.
BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings.
arXiv Detail & Related papers (2024-12-06T16:08:09Z) - Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual
Active Speaker Detection [88.74863771919445]
We reveal the vulnerability of AVASD models under audio-only, visual-only, and audio-visual adversarial attacks.
We also propose a novel audio-visual interaction loss (AVIL) for making attackers difficult to find feasible adversarial examples.
arXiv Detail & Related papers (2022-10-03T08:10:12Z) - AVA-AVD: Audio-visual Speaker Diarization in the Wild [26.97787596025907]
Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios.
We propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility.
arXiv Detail & Related papers (2021-11-29T11:02:41Z) - Learning Visual Voice Activity Detection with an Automatically Annotated
Dataset [20.725871972294236]
Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not.
We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow.
We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD.
arXiv Detail & Related papers (2020-09-23T15:12:24Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.