Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture
- URL: http://arxiv.org/abs/2504.18099v1
- Date: Fri, 25 Apr 2025 05:57:22 GMT
- Title: Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture
- Authors: Leena G Pillai, D. Muhammad Noorul Mubarak, Elizabeth Sherly,
- Abstract summary: This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics.<n>The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.
Related papers
- SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Repeat after me: Self-supervised learning of acoustic-to-articulatory
mapping by vocal imitation [9.416401293559112]
We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters.
Both forward and inverse models are jointly trained in a self-supervised way from raw acoustic-only speech data from different speakers.
The imitation simulations are evaluated objectively and subjectively and display quite encouraging performances.
arXiv Detail & Related papers (2022-04-05T15:02:49Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - Correlation based Multi-phasal models for improved imagined speech EEG
recognition [22.196642357767338]
This work aims to profit from the parallel information contained in multi-phasal EEG data recorded while speaking, imagining and performing articulatory movements corresponding to specific speech units.
A bi-phase common representation learning module using neural networks is designed to model the correlation and between an analysis phase and a support phase.
The proposed approach further handles the non-availability of multi-phasal data during decoding.
arXiv Detail & Related papers (2020-11-04T09:39:53Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Multi-modal Automated Speech Scoring using Attention Fusion [46.94442359735952]
We propose a novel multi-modal end-to-end neural approach for automated assessment of non-native English speakers' spontaneous speech.
We employ Bi-directional Recurrent Convolutional Neural Networks and Bi-directional Long Short-Term Memory Neural Networks to encode acoustic and lexical cues from spectrograms and transcriptions.
We find combined attention to both lexical and acoustic cues significantly improves the overall performance of the system.
arXiv Detail & Related papers (2020-05-17T07:53:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.