Related papers: Real-time Speaker counting in a cocktail party scenario using Attention-guided Convolutional Neural Network

Real-time Speaker counting in a cocktail party scenario using Attention-guided Convolutional Neural Network

URL: http://arxiv.org/abs/2111.00316v1
Date: Sat, 30 Oct 2021 19:24:57 GMT
Title: Real-time Speaker counting in a cocktail party scenario using Attention-guided Convolutional Neural Network
Authors: Midia Yousefi, John H.L. Hansen
Abstract summary: We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech. The proposed system extracts higher-level information from the speech spectral content using a CNN model. Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
Score: 60.99112031408449
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most current speech technology systems are designed to operate well even in the presence of multiple active speakers. However, most solutions assume that the number of co-current speakers is known. Unfortunately, this information might not always be available in real-world applications. In this study, we propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech. The proposed system extracts higher-level information from the speech spectral content using a CNN model. Next, the attention mechanism summarizes the extracted information into a compact feature vector without losing critical information. Finally, the active speakers are classified using a fully connected network. Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling. The proposed Attention-guided CNN achieves 76.15% for both Weighted Accuracy and average Recall, and 75.80% Precision on speech segments as short as 20 frames (i.e., 200 ms). All the classification metrics exceed 92% for the attention-guided model in offline scenarios where the input signal is more than 100 frames long (i.e., 1s).

Related papers

BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with Convolutional Cross Attention in Multi-talker Conditions [36.15815562576836]
Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without prior information on multi-talker conditions. We propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures.
arXiv Detail & Related papers (2023-05-17T06:40:31Z)
Speech-enhanced and Noise-aware Networks for Robust Speech Recognition [25.279902171523233]
A noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition. The two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively.
arXiv Detail & Related papers (2022-03-25T15:04:51Z)
AST: Audio Spectrogram Transformer [21.46018186487818]
We introduce the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. AST achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.
arXiv Detail & Related papers (2021-04-05T05:26:29Z)
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T) We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T. Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z)
A Two-Stage Approach to Device-Robust Acoustic Scene Classification [63.98724740606457]
Two-stage system based on fully convolutional neural networks (CNNs) is proposed to improve device robustness. Our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set. Neural saliency analysis with class activation mapping gives new insights on the patterns learnt by our models.
arXiv Detail & Related papers (2020-11-03T03:27:18Z)
Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions [19.691323658303435]
Conventional neural vocoders are adjusted to the training speaker and have poor generalization capabilities to unseen speakers. We propose a variant of WaveRNN, referred to as speaker conditional WaveRNN (SC-WaveRNN) In contrast to standard WaveRNN, SC-WaveRNN exploits additional information given in the form of speaker embeddings.
arXiv Detail & Related papers (2020-08-09T13:54:46Z)
Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach. TS-VAD directly predicts an activity of each speaker on each time frame. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech. Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times. Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)
Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals. Two main challenges are the complex acoustic environment and the real-time processing requirement. We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.