Voice Quality and Pitch Features in Transformer-Based Speech Recognition
- URL: http://arxiv.org/abs/2112.11391v1
- Date: Tue, 21 Dec 2021 17:49:06 GMT
- Title: Voice Quality and Pitch Features in Transformer-Based Speech Recognition
- Authors: Guillermo C\'ambara, Jordi Luque, Mireia Farr\'us
- Abstract summary: We study the effects of incorporating voice quality and pitch features altogether and separately to a Transformer-based ASR model.
We find mean Word Error Rate relative reductions of up to 5.6% with the LibriSpeech benchmark.
- Score: 3.921076451326107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Jitter and shimmer measurements have shown to be carriers of voice quality
and prosodic information which enhance the performance of tasks like speaker
recognition, diarization or automatic speech recognition (ASR). However, such
features have been seldom used in the context of neural-based ASR, where
spectral features often prevail. In this work, we study the effects of
incorporating voice quality and pitch features altogether and separately to a
Transformer-based ASR model, with the intuition that the attention mechanisms
might exploit latent prosodic traits. For doing so, we propose separated
convolutional front-ends for prosodic and spectral features, showing that this
architectural choice yields better results than simple concatenation of such
pitch and voice quality features to mel-spectrogram filterbanks. Furthermore,
we find mean Word Error Rate relative reductions of up to 5.6% with the
LibriSpeech benchmark. Such findings motivate further research on the
application of prosody knowledge for increasing the robustness of
Transformer-based ASR.
Related papers
- Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models [21.85677682584916]
speculative speech recognition (SSR)
We propose a model which does SSR by combining a RNN-Transducer-based ASR system with an audio-ed language model (LM)
arXiv Detail & Related papers (2024-07-05T16:52:55Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Audio-Visual Speech Enhancement with Score-Based Generative Models [22.559617939136505]
This paper introduces an audio-visual speech enhancement system that leverages score-based generative models.
We exploit audio-visual embeddings obtained from a self-super-vised learning model that has been fine-tuned on lipreading.
Experimental evaluations show that the proposed audio-visual speech enhancement system yields improved speech quality.
arXiv Detail & Related papers (2023-06-02T10:43:42Z) - PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech
Enhancement [41.872384434583466]
We propose a learning objective that formalizes differences in perceptual quality.
We identify temporal acoustic parameters that are non-differentiable.
We develop a neural network estimator that can accurately predict their time-series values.
arXiv Detail & Related papers (2023-02-16T05:17:06Z) - Similarity and Content-based Phonetic Self Attention for Speech
Recognition [16.206467862132012]
The proposed phonetic self-attention (phSA) is composed of two different types of phonetic attention.
We identify which parts of the original dot product are related to two different attention patterns and improve each part by simple modifications.
Our experiments on phoneme classification and speech recognition show that replacing SA with phSA for lower layers improves the recognition performance without increasing the latency and the parameter size.
arXiv Detail & Related papers (2022-03-19T05:35:26Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Weak-Attention Suppression For Transformer Based Speech Recognition [33.30436927415777]
We propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities.
We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement over strong transformer baselines.
arXiv Detail & Related papers (2020-05-18T23:49:40Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.