End-to-End Active Speaker Detection
- URL: http://arxiv.org/abs/2203.14250v1
- Date: Sun, 27 Mar 2022 08:55:28 GMT
- Title: End-to-End Active Speaker Detection
- Authors: Juan Leon Alcazar, Moritz Cordes, Chen Zhao, and Bernard Ghanem
- Abstract summary: We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
- Score: 58.7097258722291
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in the Active Speaker Detection (ASD) problem build upon a
two-stage process: feature extraction and spatio-temporal context aggregation.
In this paper, we propose an end-to-end ASD workflow where feature learning and
contextual predictions are jointly learned. Our end-to-end trainable network
simultaneously learns multi-modal embeddings and aggregates spatio-temporal
context. This results in more suitable feature representations and improved
performance in the ASD task. We also introduce interleaved graph neural network
(iGNN) blocks, which split the message passing according to the main sources of
context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more
suitable for ASD, resulting in state-of-the art performance.
Finally, we design a weakly-supervised strategy, which demonstrates that the
ASD problem can also be approached by utilizing audiovisual data but relying
exclusively on audio annotations. We achieve this by modelling the direct
relationship between the audio signal and the possible sound sources
(speakers), as well as introducing a contrastive loss.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene.
We introduce a novel Progressive Confident Masking Attention Network (PMCANet)
It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations [13.020158123538138]
Speech separation guided diarization (SSGD) performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream.
We consider three state-of-the-art speech separation (SSep) algorithms and study their performance in online and offline scenarios.
We show that our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model.
arXiv Detail & Related papers (2023-03-21T16:33:56Z) - Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection [21.512786675773675]
Active speaker detection in videos with multiple speakers is a challenging task.
We present SPELL, a novel spatial-temporal graph learning framework.
SPELL is able to reason over long temporal contexts for all nodes without relying on computationally expensive fully connected graph neural networks.
arXiv Detail & Related papers (2022-07-15T23:43:17Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - Rethinking Audio-visual Synchronization for Active Speaker Detection [62.95962896690992]
Existing research on active speaker detection (ASD) does not agree on the definition of active speakers.
We propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue.
Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.
arXiv Detail & Related papers (2022-06-21T14:19:06Z) - Towards Intelligibility-Oriented Audio-Visual Speech Enhancement [8.19144665585397]
We present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function.
Our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions.
arXiv Detail & Related papers (2021-11-18T11:47:37Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Temporarily-Aware Context Modelling using Generative Adversarial
Networks for Speech Activity Detection [43.662221486962274]
We propose a novel joint learning framework for Speech Activity Detection (SAD)
We utilise generative adversarial networks to automatically learn a loss function for joint prediction of the frame-wise speech/ non-speech classifications together with the next audio segment.
We evaluate the proposed framework on multiple public benchmarks, including NIST OpenSAT' 17, AMI Meeting and HAVIC.
arXiv Detail & Related papers (2020-04-02T02:33:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.