Real-time Speaker counting in a cocktail party scenario using
Attention-guided Convolutional Neural Network
- URL: http://arxiv.org/abs/2111.00316v1
- Date: Sat, 30 Oct 2021 19:24:57 GMT
- Title: Real-time Speaker counting in a cocktail party scenario using
Attention-guided Convolutional Neural Network
- Authors: Midia Yousefi, John H.L. Hansen
- Abstract summary: We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech.
The proposed system extracts higher-level information from the speech spectral content using a CNN model.
Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
- Score: 60.99112031408449
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most current speech technology systems are designed to operate well even in
the presence of multiple active speakers. However, most solutions assume that
the number of co-current speakers is known. Unfortunately, this information
might not always be available in real-world applications. In this study, we
propose a real-time, single-channel attention-guided Convolutional Neural
Network (CNN) to estimate the number of active speakers in overlapping speech.
The proposed system extracts higher-level information from the speech spectral
content using a CNN model. Next, the attention mechanism summarizes the
extracted information into a compact feature vector without losing critical
information. Finally, the active speakers are classified using a fully
connected network. Experiments on simulated overlapping speech using WSJ corpus
show that the attention solution is shown to improve the performance by almost
3% absolute over conventional temporal average pooling. The proposed
Attention-guided CNN achieves 76.15% for both Weighted Accuracy and average
Recall, and 75.80% Precision on speech segments as short as 20 frames (i.e.,
200 ms). All the classification metrics exceed 92% for the attention-guided
model in offline scenarios where the input signal is more than 100 frames long
(i.e., 1s).
Related papers
- BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with
Convolutional Cross Attention in Multi-talker Conditions [36.15815562576836]
Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without prior information on multi-talker conditions.
We propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures.
arXiv Detail & Related papers (2023-05-17T06:40:31Z) - Speech-enhanced and Noise-aware Networks for Robust Speech Recognition [25.279902171523233]
A noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition.
The two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task.
Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively.
arXiv Detail & Related papers (2022-03-25T15:04:51Z) - AST: Audio Spectrogram Transformer [21.46018186487818]
We introduce the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification.
AST achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.
arXiv Detail & Related papers (2021-04-05T05:26:29Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T)
We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T.
Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z) - A Two-Stage Approach to Device-Robust Acoustic Scene Classification [63.98724740606457]
Two-stage system based on fully convolutional neural networks (CNNs) is proposed to improve device robustness.
Our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set.
Neural saliency analysis with class activation mapping gives new insights on the patterns learnt by our models.
arXiv Detail & Related papers (2020-11-03T03:27:18Z) - Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen
Speaker and Recording Conditions [19.691323658303435]
Conventional neural vocoders are adjusted to the training speaker and have poor generalization capabilities to unseen speakers.
We propose a variant of WaveRNN, referred to as speaker conditional WaveRNN (SC-WaveRNN)
In contrast to standard WaveRNN, SC-WaveRNN exploits additional information given in the form of speaker embeddings.
arXiv Detail & Related papers (2020-08-09T13:54:46Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.