A Light Weight Model for Active Speaker Detection
- URL: http://arxiv.org/abs/2303.04439v1
- Date: Wed, 8 Mar 2023 08:40:56 GMT
- Title: A Light Weight Model for Active Speaker Detection
- Authors: Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang and
Liangyin Chen
- Abstract summary: We construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling.
Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%).
Our framework also performs well on the Columbia dataset showing good robustness.
- Score: 7.253335671577093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Active speaker detection is a challenging task in audio-visual scenario
understanding, which aims to detect who is speaking in one or more speakers
scenarios. This task has received extensive attention as it is crucial in
applications such as speaker diarization, speaker tracking, and automatic video
editing. The existing studies try to improve performance by inputting multiple
candidate information and designing complex models. Although these methods
achieved outstanding performance, their high consumption of memory and
computational power make them difficult to be applied in resource-limited
scenarios. Therefore, we construct a lightweight active speaker detection
architecture by reducing input candidates, splitting 2D and 3D convolutions for
audio-visual feature extraction, and applying gated recurrent unit (GRU) with
low computational complexity for cross-modal modeling. Experimental results on
the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP
performance (94.1% vs. 94.2%), while the resource costs are significantly lower
than the state-of-the-art method, especially in model parameters (1.0M vs.
22.5M, about 23x) and FLOPs (0.6G vs. 2.6G, about 4x). In addition, our
framework also performs well on the Columbia dataset showing good robustness.
The code and model weights are available at
https://github.com/Junhua-Liao/Light-ASD.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by
Connecting Foundation Models [14.538853403226751]
Building artificial intelligence systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research.
We propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM.
Our method only requires a quick training of the V2A-Mapper to produce high-fidelity and visually-aligned sound.
arXiv Detail & Related papers (2023-08-18T04:49:38Z) - End-To-End Audiovisual Feature Fusion for Active Speaker Detection [7.631698269792165]
This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform.
Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art -work.
arXiv Detail & Related papers (2022-07-27T10:25:59Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z) - UniCon: Unified Context Network for Robust Active Speaker Detection [111.90529347692723]
We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD)
Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information.
A thorough ablation study is performed on several challenging ASD benchmarks under different settings.
arXiv Detail & Related papers (2021-08-05T13:25:44Z) - Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T)
We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T.
Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.