Introducing ECAPA-TDNN and Wav2Vec2.0 Embeddings to Stuttering Detection
- URL: http://arxiv.org/abs/2204.01564v1
- Date: Mon, 4 Apr 2022 15:12:25 GMT
- Title: Introducing ECAPA-TDNN and Wav2Vec2.0 Embeddings to Stuttering Detection
- Authors: Shakeel Ahmad Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni
- Abstract summary: This work introduces the application of speech embeddings extracted with pre-trained deep models trained on massive audio datasets for different tasks.
In comparison to the standard stuttering detection system trained only on the limited SEP-28k dataset, we obtain a relative improvement of 16.74% in terms of overall accuracy over baseline.
- Score: 7.42741711946564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The adoption of advanced deep learning (DL) architecture in stuttering
detection (SD) tasks is challenging due to the limited size of the available
datasets. To this end, this work introduces the application of speech
embeddings extracted with pre-trained deep models trained on massive audio
datasets for different tasks. In particular, we explore audio representations
obtained using emphasized channel attention, propagation, and
aggregation-time-delay neural network (ECAPA-TDNN) and Wav2Vec2.0 model trained
on VoxCeleb and LibriSpeech datasets respectively. After extracting the
embeddings, we benchmark with several traditional classifiers, such as a
k-nearest neighbor, Gaussian naive Bayes, and neural network, for the
stuttering detection tasks. In comparison to the standard SD system trained
only on the limited SEP-28k dataset, we obtain a relative improvement of 16.74%
in terms of overall accuracy over baseline. Finally, we have shown that
combining two embeddings and concatenating multiple layers of Wav2Vec2.0 can
further improve SD performance up to 1% and 2.64% respectively.
Related papers
- Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning [50.809769498312434]
We propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS)
Our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.
arXiv Detail & Related papers (2023-11-22T03:45:30Z) - Stuttering Detection Using Speaker Representations and Self-supervised
Contextual Embeddings [7.42741711946564]
We introduce the application of speech embeddings extracted from pre-trained deep learning models trained on large audio datasets for different tasks.
In comparison to the standard SD systems trained only on the limited SEP-28k dataset, we obtain a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average recall (UAR) over the baselines.
arXiv Detail & Related papers (2023-06-01T14:00:47Z) - LEAN: Light and Efficient Audio Classification Network [1.5070398746522742]
We propose a lightweight on-device deep learning-based model for audio classification, LEAN.
LEAN consists of a raw waveform-based temporal feature extractor called as Wave realignment and logmel-based Pretrained YAMNet.
We show that using a combination of trainable wave encoder, Pretrained YAMNet along with cross attention-based temporal realignment, results in competitive performance on downstream audio classification tasks with lesser memory footprints.
arXiv Detail & Related papers (2023-05-22T04:45:04Z) - BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to
Real-Network Performance [54.214426436283134]
Deep neural networks, such as the Deep-FSMN, have been widely studied for keyword spotting (KWS) applications.
We present a strong yet efficient binary neural network for KWS, namely BiFSMNv2, pushing it to the real-network accuracy performance.
We highlight that benefiting from the compact architecture and optimized hardware kernel, BiFSMNv2 can achieve an impressive 25.1x speedup and 20.2x storage-saving on edge hardware.
arXiv Detail & Related papers (2022-11-13T18:31:45Z) - Training speaker recognition systems with limited data [2.3148470932285665]
This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work.
We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset.
We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited.
arXiv Detail & Related papers (2022-03-28T12:41:41Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - A Deep Neural Network for SSVEP-based Brain-Computer Interfaces [3.0595138995552746]
Target identification in brain-computer interface (BCI) spellers refers to the electroencephalogram (EEG) classification for predicting the target character that the subject intends to spell.
In this setting, we address the target identification and propose a novel deep neural network (DNN) architecture.
The proposed DNN processes the multi-channel SSVEP with convolutions across the sub-bands of harmonics, channels, time, and classifies at the fully connected layer.
arXiv Detail & Related papers (2020-11-17T11:11:19Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.