Whisper in Focus: Enhancing Stuttered Speech Classification with Encoder
Layer Optimization
- URL: http://arxiv.org/abs/2311.05203v1
- Date: Thu, 9 Nov 2023 08:32:49 GMT
- Title: Whisper in Focus: Enhancing Stuttered Speech Classification with Encoder
Layer Optimization
- Authors: Huma Ameer, Seemab Latif, Rabia Latif, Sana Mukhtar
- Abstract summary: This study unravels the capabilities of Whisper for the classification of disfluency types in stuttered speech.
We have made notable contributions in three pivotal areas: enhancing the quality of SEP28-k benchmark dataset, exploration of Whisper for classification, and introducing an efficient encoder layer freezing strategy.
- Score: 0.16385815610837165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, advancements in the field of speech processing have led to
cutting-edge deep learning algorithms with immense potential for real-world
applications. The automated identification of stuttered speech is one of such
applications that the researchers are addressing by employing deep learning
techniques. Recently, researchers have utilized Wav2vec2.0, a speech
recognition model to classify disfluency types in stuttered speech. Although
Wav2vec2.0 has shown commendable results, its ability to generalize across all
disfluency types is limited. In addition, since its base model uses 12 encoder
layers, it is considered a resource-intensive model. Our study unravels the
capabilities of Whisper for the classification of disfluency types in stuttered
speech. We have made notable contributions in three pivotal areas: enhancing
the quality of SEP28-k benchmark dataset, exploration of Whisper for
classification, and introducing an efficient encoder layer freezing strategy.
The optimized Whisper model has achieved the average F1-score of 0.81, which
proffers its abilities. This study also unwinds the significance of deeper
encoder layers in the identification of disfluency types, as the results
demonstrate their greater contribution compared to initial layers. This
research represents substantial contributions, shifting the emphasis towards an
efficient solution, thereby thriving towards prospective innovation.
Related papers
- An Energy-based Model for Word-level AutoCompletion in Computer-aided Translation [97.3797716862478]
Word-level AutoCompletion (WLAC) is a rewarding yet challenging task in Computer-aided Translation.
Existing work addresses this task through a classification model based on a neural network that maps the hidden vector of the input context into its corresponding label.
This work proposes an energy-based model for WLAC, which enables the context hidden vector to capture crucial information from the source sentence.
arXiv Detail & Related papers (2024-07-29T15:07:19Z) - Optimizing Multi-Stuttered Speech Classification: Leveraging Whisper's Encoder for Efficient Parameter Reduction in Automated Assessment [0.14999444543328289]
This research study unveils the contribution of the last encoder layer in the identification of disfluencies in stuttered speech.
It has led to a computationally efficient approach, 83.7% less parameters to train, making the proposed approach more adaptable for various dialects and languages.
arXiv Detail & Related papers (2024-06-09T13:42:51Z) - SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection [31.464227593768324]
We introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies.
SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies.
SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector.
arXiv Detail & Related papers (2024-05-16T12:42:06Z) - Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - What to Remember: Self-Adaptive Continual Learning for Audio Deepfake
Detection [53.063161380423715]
Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types.
We propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection.
arXiv Detail & Related papers (2023-12-15T09:52:17Z) - Deep Feature Learning for Medical Acoustics [78.56998585396421]
The purpose of this paper is to compare different learnables in medical acoustics tasks.
A framework has been implemented to classify human respiratory sounds and heartbeats in two categories, i.e. healthy or affected by pathologies.
arXiv Detail & Related papers (2022-08-05T10:39:37Z) - E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language
Understanding and Generation [95.49128988683191]
Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models.
We propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2.
E2S2 improves the seq2seq models via integrating more efficient self-supervised information into the encoders.
arXiv Detail & Related papers (2022-05-30T08:25:36Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning [23.13972240042859]
We propose an end-to-end deep neural network, FluentNet, capable of detecting a number of different disfluency types.
FluentNet consists of a Squeeze-and-Excitation Residual convolutional neural network which facilitate the learning of strong spectral frame-level representations.
We present a disfluency dataset based on the public LibriSpeech dataset with synthesized stutters.
arXiv Detail & Related papers (2020-09-23T21:51:29Z) - End-to-End Auditory Object Recognition via Inception Nucleus [7.22898229765707]
We propose a novel end-to-end deep neural network to map the raw waveform inputs to sound class labels.
Our network includes an "inception nucleus" that optimize the size of convolutional filters on the fly.
arXiv Detail & Related papers (2020-05-25T16:08:41Z) - Decoding Imagined Speech using Wavelet Features and Deep Neural Networks [2.4063592468412267]
This paper proposes a novel approach that uses deep neural networks for classifying imagined speech.
The proposed approach employs only the EEG channels over specific areas of the brain for classification, and derives distinct feature vectors from each of those channels.
The proposed architecture and the approach of treating the data have resulted in an average classification accuracy of 57.15%, which is an improvement of around 35% over the state-of-the-art results.
arXiv Detail & Related papers (2020-03-19T00:36:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.