Investigation of Self-supervised Pre-trained Models for Classification
of Voice Quality from Speech and Neck Surface Accelerometer Signals
- URL: http://arxiv.org/abs/2308.03226v1
- Date: Sun, 6 Aug 2023 23:16:54 GMT
- Title: Investigation of Self-supervised Pre-trained Models for Classification
of Voice Quality from Speech and Neck Surface Accelerometer Signals
- Authors: Sudarsana Reddy Kadiri, Farhad Javanmardi, Paavo Alku
- Abstract summary: This study examines simultaneously-recorded speech and NSA signals in the classification of voice quality.
The effectiveness of pre-trained models is compared in feature extraction between glottal source waveforms and raw signal waveforms for both speech and NSA inputs.
- Score: 27.398425786898223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prior studies in the automatic classification of voice quality have mainly
studied the use of the acoustic speech signal as input. Recently, a few studies
have been carried out by jointly using both speech and neck surface
accelerometer (NSA) signals as inputs, and by extracting MFCCs and glottal
source features. This study examines simultaneously-recorded speech and NSA
signals in the classification of voice quality (breathy, modal, and pressed)
using features derived from three self-supervised pre-trained models
(wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) and using a SVM as well as CNNs as
classifiers. Furthermore, the effectiveness of the pre-trained models is
compared in feature extraction between glottal source waveforms and raw signal
waveforms for both speech and NSA inputs. Using two signal processing methods
(quasi-closed phase (QCP) glottal inverse filtering and zero frequency
filtering (ZFF)), glottal source waveforms are estimated from both speech and
NSA signals. The study has three main goals: (1) to study whether features
derived from pre-trained models improve classification accuracy compared to
conventional features (spectrogram, mel-spectrogram, MFCCs, i-vector, and
x-vector), (2) to investigate which of the two modalities (speech vs. NSA) is
more effective in the classification task with pre-trained model-based
features, and (3) to evaluate whether the deep learning-based CNN classifier
can enhance the classification accuracy in comparison to the SVM classifier.
The results revealed that the use of the NSA input showed better classification
performance compared to the speech signal. Between the features, the
pre-trained model-based features showed better classification accuracies, both
for speech and NSA inputs compared to the conventional features. It was also
found that the HuBERT features performed better than the wav2vec2-BASE and
wav2vec2-LARGE features.
Related papers
- Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models [42.39774323584976]
We propose a deep learning based system for the task of deepfake audio detection.
In particular, the draw input audio is first transformed into various spectrograms.
We leverage the state-of-the-art audio pre-trained models of Whisper, Seamless, Speechbrain, and Pyannote to extract audio embeddings.
arXiv Detail & Related papers (2024-07-01T20:10:43Z) - Comparative Analysis of the wav2vec 2.0 Feature Extractor [42.18541127866435]
We study the capability to replace the standard feature extraction methods in a connectionist temporal classification (CTC) ASR model.
We show that both are competitive with traditional FEs on the LibriSpeech benchmark and analyze the effect of the individual components.
arXiv Detail & Related papers (2023-08-08T14:29:35Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Non-Intrusive Binaural Speech Intelligibility Prediction from Discrete
Latent Representations [1.1472707084860878]
Speech intelligibility prediction from signals is useful in many applications.
Measures specifically designed to take into account the properties of the signal are often intrusive.
This paper proposes a non-intrusive SI measure that computes features from an input signal using a combination of vector quantization (VQ) and contrastive predictive coding (CPC) methods.
arXiv Detail & Related papers (2021-11-24T14:55:04Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application [63.2243126704342]
This study presents a deep learning-based speech signal-processing mobile application known as CITISEN.
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC)
Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
arXiv Detail & Related papers (2020-08-21T02:04:12Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.