Ultra2Speech -- A Deep Learning Framework for Formant Frequency
Estimation and Tracking from Ultrasound Tongue Images
- URL: http://arxiv.org/abs/2006.16367v1
- Date: Mon, 29 Jun 2020 20:42:11 GMT
- Title: Ultra2Speech -- A Deep Learning Framework for Formant Frequency
Estimation and Tracking from Ultrasound Tongue Images
- Authors: Pramit Saha, Yadong Liu, Bryan Gick, Sidney Fels
- Abstract summary: This work addresses the arttory-to-acoustic mapping problem based on ultrasound (US) tongue images.
We use a novel deep learning architecture to map US tongue images from the US placed beneath a subject's chin to formants that we call, Ultrasound2Formant (U2F) Net.
- Score: 5.606679908174784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Thousands of individuals need surgical removal of their larynx due to
critical diseases every year and therefore, require an alternative form of
communication to articulate speech sounds after the loss of their voice box.
This work addresses the articulatory-to-acoustic mapping problem based on
ultrasound (US) tongue images for the development of a silent-speech interface
(SSI) that can provide them with an assistance in their daily interactions. Our
approach targets automatically extracting tongue movement information by
selecting an optimal feature set from US images and mapping these features to
the acoustic space. We use a novel deep learning architecture to map US tongue
images from the US probe placed beneath a subject's chin to formants that we
call, Ultrasound2Formant (U2F) Net. It uses hybrid spatio-temporal 3D
convolutions followed by feature shuffling, for the estimation and tracking of
vowel formants from US images. The formant values are then utilized to
synthesize continuous time-varying vowel trajectories, via Klatt Synthesizer.
Our best model achieves R-squared (R^2) measure of 99.96% for the regression
task. Our network lays the foundation for an SSI as it successfully tracks the
tongue contour automatically as an internal representation without any explicit
annotation.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Disentanglement in a GAN for Unconditional Speech Synthesis [28.998590651956153]
We propose AudioStyleGAN -- a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space.
ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer.
We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis.
arXiv Detail & Related papers (2023-07-04T12:06:07Z) - RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting
Self-Supervised Representations [13.995231731152462]
We propose RobustL2S, a modularized framework for Lip-to-Speech synthesis.
A non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content.
A vocoder then converts the speech features into raw waveforms.
arXiv Detail & Related papers (2023-07-03T09:13:57Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs.
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Adaptation of Tacotron2-based Text-To-Speech for
Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging [48.7576911714538]
This paper experiments with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve articulatory-to-acoustic mapping.
We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder.
arXiv Detail & Related papers (2021-07-26T09:19:20Z) - Improving Ultrasound Tongue Image Reconstruction from Lip Images Using
Self-supervised Learning and Attention Mechanism [1.52292571922932]
Given an observable image sequences of lips, can we picture the corresponding tongue motion?
We formulated this problem as the self-supervised learning problem, and employ the two-stream convolutional network and long-short memory network for the learning task, with the attention mechanism.
The results show that our model is able to generate images that close to the real ultrasound tongue images, and results in the matching between two imaging modalities.
arXiv Detail & Related papers (2021-06-20T10:51:23Z) - Convolutional Neural Network-Based Age Estimation Using B-Mode
Ultrasound Tongue Image [10.100437437151621]
We explore the feasibility of age estimation using the ultrasound tongue image of the speakers.
Motivated by the success of deep learning, this paper leverages deep learning on this task.
The developed method can be used a tool to evaluate the performance of speech therapy sessions.
arXiv Detail & Related papers (2021-01-27T08:00:47Z) - Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic
Speech Synthesis [59.623780036359655]
Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators.
This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury.
We propose a solution to this problem based on the theory of multi-view learning.
arXiv Detail & Related papers (2020-12-30T15:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.