Acoustic-to-articulatory Inversion based on Speech Decomposition and
Auxiliary Feature
- URL: http://arxiv.org/abs/2204.00873v1
- Date: Sat, 2 Apr 2022 14:47:19 GMT
- Title: Acoustic-to-articulatory Inversion based on Speech Decomposition and
Auxiliary Feature
- Authors: Jianrong Wang, Jinyu Liu, Longxuan Zhao, Shanyu Wang, Ruiguo Yu, Li
Liu
- Abstract summary: We pre-train a speech decomposition network to decompose audio speech into speaker embedding and content embedding.
We then propose a novel auxiliary feature network to estimate the lip auxiliary features from the personalized speech features.
Experimental results show that, compared with the state-of-the-art only using the audio speech feature, the proposed method reduces the average RMSE by 0.25 and increases the average correlation coefficient by 2.0%.
- Score: 7.363994037183394
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Acoustic-to-articulatory inversion (AAI) is to obtain the movement of
articulators from speech signals. Until now, achieving a speaker-independent
AAI remains a challenge given the limited data. Besides, most current works
only use audio speech as input, causing an inevitable performance bottleneck.
To solve these problems, firstly, we pre-train a speech decomposition network
to decompose audio speech into speaker embedding and content embedding as the
new personalized speech features to adapt to the speaker-independent case.
Secondly, to further improve the AAI, we propose a novel auxiliary feature
network to estimate the lip auxiliary features from the above personalized
speech features. Experimental results on three public datasets show that,
compared with the state-of-the-art only using the audio speech feature, the
proposed method reduces the average RMSE by 0.25 and increases the average
correlation coefficient by 2.0% in the speaker-dependent case. More
importantly, the average RMSE decreases by 0.29 and the average correlation
coefficient increases by 5.0% in the speaker-independent case.
Related papers
- Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Convoifilter: A case study of doing cocktail party speech recognition [59.80042864360884]
The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach.
We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
arXiv Detail & Related papers (2023-08-22T12:09:30Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Use of Speech Impairment Severity for Dysarthric Speech Recognition [37.93801885333925]
This paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognition.
Experiments conducted on UASpeech suggest incorporating speech impairment severity into state-of-the-art hybrid DNN, E2E Conformer and pre-trained Wav2vec 2.0 ASR systems.
arXiv Detail & Related papers (2023-05-18T02:42:59Z) - Learning from human perception to improve automatic speaker verification
in style-mismatched conditions [21.607777746331998]
Our prior experiments show that humans and machines seem to employ different approaches to speaker discrimination.
We use insights learnt from human perception to design a new training loss function that we refer to as "CllrCE loss"
CllrCE loss uses both speaker-specific idiosyncrasies and relative acoustic distances between speakers to train the ASV system.
arXiv Detail & Related papers (2022-06-28T01:24:38Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z) - Data augmentation using prosody and false starts to recognize non-native
children's speech [12.911954427107977]
This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition.
The task is to recognize non-native speech from children of various age groups given a limited amount of speech.
arXiv Detail & Related papers (2020-08-29T05:32:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.