Synthesizing audio from tongue motion during speech using tagged MRI via
transformer
- URL: http://arxiv.org/abs/2302.07203v1
- Date: Tue, 14 Feb 2023 17:27:55 GMT
- Title: Synthesizing audio from tongue motion during speech using tagged MRI via
transformer
- Authors: Xiaofeng Liu, Fangxu Xing, Jerry L. Prince, Maureen Stone, Georges El
Fakhri, Jonghye Woo
- Abstract summary: We present an efficient deformation-decoder translation network for exploring the predictive information inherent in 4D motion fields via 2D spectrograms.
Our framework has the potential to improve our understanding of the relationship between these two modalities and inform the development of treatments for speech disorders.
- Score: 13.442093381065268
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Investigating the relationship between internal tissue point motion of the
tongue and oropharyngeal muscle deformation measured from tagged MRI and
intelligible speech can aid in advancing speech motor control theories and
developing novel treatment methods for speech related-disorders. However,
elucidating the relationship between these two sources of information is
challenging, due in part to the disparity in data structure between
spatiotemporal motion fields (i.e., 4D motion fields) and one-dimensional audio
waveforms. In this work, we present an efficient encoder-decoder translation
network for exploring the predictive information inherent in 4D motion fields
via 2D spectrograms as a surrogate of the audio data. Specifically, our encoder
is based on 3D convolutional spatial modeling and transformer-based temporal
modeling. The extracted features are processed by an asymmetric 2D convolution
decoder to generate spectrograms that correspond to 4D motion fields.
Furthermore, we incorporate a generative adversarial training approach into our
framework to further improve synthesis quality on our generated spectrograms.
We experiment on 63 paired motion field sequences and speech waveforms,
demonstrating that our framework enables the generation of clear audio
waveforms from a sequence of motion fields. Thus, our framework has the
potential to improve our understanding of the relationship between these two
modalities and inform the development of treatments for speech disorders.
Related papers
- Simulating Articulatory Trajectories with Phonological Feature Interpolation [15.482738311360972]
We investigate the forward mapping between pseudo-motor commands and articulatory trajectories.
Two phonological feature sets, based respectively on generative and articulatory phonology, are used to encode a phonetic target sequence.
We discuss the implications of our results for our understanding of the dynamics of biological motion.
arXiv Detail & Related papers (2024-08-08T10:51:16Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal.
To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z) - Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix
Factorization via Plastic Transformer [11.91784203088159]
We develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms.
Our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models.
arXiv Detail & Related papers (2023-09-26T00:21:17Z) - Pathology Synthesis of 3D-Consistent Cardiac MR Images using 2D VAEs and
GANs [0.5039813366558306]
We propose a method for generating labeled data for the application of supervised deep-learning (DL) training.
The image synthesis consists of label deformation and label-to-image translation tasks.
We demonstrate that such an approach could provide a solution to diversify and enrich an available database of cardiac MR images.
arXiv Detail & Related papers (2022-09-09T10:17:49Z) - Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention
Guided Heterogeneous Translator [12.685817926272161]
We develop an end-to-end deep learning framework to translate from a sequence of tagged-MRI to its corresponding audio waveform with limited dataset size.
Our framework is based on a novel fully convolutional asymmetry translator with guidance of a self residual attention strategy.
Our experimental results, carried out with a total of 63 tagged-MRI sequences alongside speech acoustics, showed that our framework enabled the generation of clear audio waveforms.
arXiv Detail & Related papers (2022-06-05T23:08:34Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Adaptation of Tacotron2-based Text-To-Speech for
Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging [48.7576911714538]
This paper experiments with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve articulatory-to-acoustic mapping.
We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder.
arXiv Detail & Related papers (2021-07-26T09:19:20Z) - Attention and Encoder-Decoder based models for transforming articulatory
movements at different speaking rates [60.02121449986413]
We propose an encoder-decoder architecture using LSTMs which generates smoother predicted articulatory trajectories.
We analyze amplitude of the transformed articulatory movements at different rates compared to their original counterparts.
We observe that AstNet could model both duration and extent of articulatory movements better than the existing transformation techniques.
arXiv Detail & Related papers (2020-06-04T19:33:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.