Stuttering Detection Using Speaker Representations and Self-supervised
Contextual Embeddings
- URL: http://arxiv.org/abs/2306.00689v1
- Date: Thu, 1 Jun 2023 14:00:47 GMT
- Title: Stuttering Detection Using Speaker Representations and Self-supervised
Contextual Embeddings
- Authors: Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni
- Abstract summary: We introduce the application of speech embeddings extracted from pre-trained deep learning models trained on large audio datasets for different tasks.
In comparison to the standard SD systems trained only on the limited SEP-28k dataset, we obtain a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average recall (UAR) over the baselines.
- Score: 7.42741711946564
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The adoption of advanced deep learning architectures in stuttering detection
(SD) tasks is challenging due to the limited size of the available datasets. To
this end, this work introduces the application of speech embeddings extracted
from pre-trained deep learning models trained on large audio datasets for
different tasks. In particular, we explore audio representations obtained using
emphasized channel attention, propagation, and aggregation time delay neural
network (ECAPA-TDNN) and Wav2Vec2.0 models trained on VoxCeleb and LibriSpeech
datasets respectively. After extracting the embeddings, we benchmark with
several traditional classifiers, such as the K-nearest neighbour (KNN),
Gaussian naive Bayes, and neural network, for the SD tasks. In comparison to
the standard SD systems trained only on the limited SEP-28k dataset, we obtain
a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average
recall (UAR) over the baselines. Finally, we have shown that combining two
embeddings and concatenating multiple layers of Wav2Vec2.0 can further improve
the UAR by up to 2.60% and 6.32% respectively.
Related papers
- SparseVSR: Lightweight and Noise Robust Visual Speech Recognition [100.43280310123784]
We generate a lightweight model that achieves higher performance than its dense model equivalent.
Our results confirm that sparse networks are more resistant to noise than dense networks.
arXiv Detail & Related papers (2023-07-10T13:34:13Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Introducing ECAPA-TDNN and Wav2Vec2.0 Embeddings to Stuttering Detection [7.42741711946564]
This work introduces the application of speech embeddings extracted with pre-trained deep models trained on massive audio datasets for different tasks.
In comparison to the standard stuttering detection system trained only on the limited SEP-28k dataset, we obtain a relative improvement of 16.74% in terms of overall accuracy over baseline.
arXiv Detail & Related papers (2022-04-04T15:12:25Z) - Training speaker recognition systems with limited data [2.3148470932285665]
This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work.
We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset.
We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited.
arXiv Detail & Related papers (2022-03-28T12:41:41Z) - Multi-turn RNN-T for streaming recognition of multi-party speech [2.899379040028688]
This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T)
We introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set.
We propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture.
arXiv Detail & Related papers (2021-12-19T17:22:58Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning
for Automatic Speech Recognition [126.5605160882849]
We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency.
We report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks.
arXiv Detail & Related papers (2021-09-27T17:59:19Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.