BIAS: A Body-based Interpretable Active Speaker Approach
- URL: http://arxiv.org/abs/2412.05150v1
- Date: Fri, 06 Dec 2024 16:08:09 GMT
- Title: BIAS: A Body-based Interpretable Active Speaker Approach
- Authors: Tiago Roxo, Joana C. Costa, Pedro R. M. Inácio, Hugo Proença,
- Abstract summary: BIAS is a model that combines audio, face, and body information to accurately predict speakers in varying/challenging conditions.
Results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance.
BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings.
- Score: 13.154512864498912
- License:
- Abstract: State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS interpretability. The results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance (Columbia, open-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker, where face is more influential than body for ASD. BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings, making it a strong baseline for further developments in interpretable ASD models, and is available at https://github.com/Tiago-Roxo/BIAS.
Related papers
- ASDnB: Merging Face with Body Cues For Robust Active Speaker Detection [13.154512864498912]
We propose ASDnB, a model that singularly integrates face with body information.
Our approach splits 3D convolution into 2D and 1D to reduce computation cost without loss of performance.
arXiv Detail & Related papers (2024-12-11T18:12:06Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Advancing Test-Time Adaptation in Wild Acoustic Test Settings [26.05732574338255]
Speech signals follow short-term consistency, requiring specialized adaptation strategies.
We propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models.
Our approach outperforms existing baselines under various wild acoustic test settings.
arXiv Detail & Related papers (2023-10-14T06:22:08Z) - WASD: A Wilder Active Speaker Detection Dataset [0.0]
Current Active Speaker Detection (ASD) models achieve great results on AVA-ActiveSpeaker (AVA) using only sound and facial features.
We propose a Wilder Active Speaker Detection (WASD) dataset, with increased difficulty by targeting the two key components of current ASD: audio and face.
We select state-of-the-art models and assess their performance in two groups of WASD: Easy (cooperative settings) and Hard (audio and/or face are specifically degraded)
arXiv Detail & Related papers (2023-03-09T15:13:22Z) - Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target.
The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well.
We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z) - Incorporating Dynamic Semantics into Pre-Trained Language Model for
Aspect-based Sentiment Analysis [67.41078214475341]
We propose Dynamic Re-weighting BERT (DR-BERT) to learn dynamic aspect-oriented semantics for ABSA.
Specifically, we first take the Stack-BERT layers as a primary encoder to grasp the overall semantic of the sentence.
We then fine-tune it by incorporating a lightweight Dynamic Re-weighting Adapter (DRA)
arXiv Detail & Related papers (2022-03-30T14:48:46Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.