Related papers: LoCoNet: Long-Short Context Network for Active Speaker Detection

LoCoNet: Long-Short Context Network for Active Speaker Detection

URL: http://arxiv.org/abs/2301.08237v2
Date: Fri, 29 Mar 2024 22:29:03 GMT
Title: LoCoNet: Long-Short Context Network for Active Speaker Detection
Authors: Xizi Wang, Feng Cheng, Gedas Bertasius, David Crandall,
Abstract summary: Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. We propose LoCoNet, a simple yet effective Long-Short Context Network. LoCoNet achieves state-of-the-art performance on multiple datasets.
Score: 18.06037779826666
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. ASD reasons from audio and visual information from two contexts: long-term intra-speaker context and short-term inter-speaker context. Long-term intra-speaker context models the temporal dependencies of the same speaker, while short-term inter-speaker context models the interactions of speakers in the same scene. These two contexts are complementary to each other and can help infer the active speaker. Motivated by these observations, we propose LoCoNet, a simple yet effective Long-Short Context Network that models the long-term intra-speaker context and short-term inter-speaker context. We use self-attention to model long-term intra-speaker context due to its effectiveness in modeling long-range dependencies, and convolutional blocks that capture local patterns to model short-term inter-speaker context. Extensive experiments show that LoCoNet achieves state-of-the-art performance on multiple datasets, achieving an mAP of 95.2%(+1.1%) on AVA-ActiveSpeaker, 68.1%(+22%) on Columbia dataset, 97.2%(+2.8%) on Talkies dataset and 59.7%(+8.0%) on Ego4D dataset. Moreover, in challenging cases where multiple speakers are present, or face of active speaker is much smaller than other faces in the same scene, LoCoNet outperforms previous state-of-the-art methods by 3.4% on the AVA-ActiveSpeaker dataset. The code will be released at https://github.com/SJTUwxz/LoCoNet_ASD.

Related papers

Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
A Light Weight Model for Active Speaker Detection [7.253335671577093]
We construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling. Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%). Our framework also performs well on the Columbia dataset showing good robustness.
arXiv Detail & Related papers (2023-03-08T08:40:56Z)
Learning Spatial-Temporal Graphs for Active Speaker Detection [26.45877018368872]
SPELL is a framework that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data. We first construct a graph from a video so that each node corresponds to one person. We demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance.
arXiv Detail & Related papers (2021-12-02T18:29:07Z)
Real-time Speaker counting in a cocktail party scenario using Attention-guided Convolutional Neural Network [60.99112031408449]
We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech. The proposed system extracts higher-level information from the speech spectral content using a CNN model. Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
arXiv Detail & Related papers (2021-10-30T19:24:57Z)
UniCon: Unified Context Network for Robust Active Speaker Detection [111.90529347692723]
We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD) Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information. A thorough ablation study is performed on several challenging ASD benchmarks under different settings.
arXiv Detail & Related papers (2021-08-05T13:25:44Z)
How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild [8.368543987898732]
This work presents several practical guidelines for audio-visual active speaker detection. We present a new architecture called ASDNet, which achieves a new state-of-the-art on the AVA-ActiveSpeaker dataset with a mAP of 93.5%.
arXiv Detail & Related papers (2021-06-07T19:44:56Z)
MAAS: Multi-modal Assignation for Active Speaker Detection [59.08836580733918]
We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem. Our experiments show that, an small graph data structure built from a single frame, allows to approximate an instantaneous audio-visual assignment problem.
arXiv Detail & Related papers (2021-01-11T02:57:25Z)
Self-attention encoding and pooling for speaker recognition [16.96341561111918]
We propose a tandem Self-Attention and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 & 2 datasets.
arXiv Detail & Related papers (2020-08-03T09:31:27Z)
Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z)
Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach. TS-VAD directly predicts an activity of each speaker on each time frame. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.