Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention
Guided Heterogeneous Translator
- URL: http://arxiv.org/abs/2206.02284v2
- Date: Thu, 9 Jun 2022 16:27:16 GMT
- Title: Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention
Guided Heterogeneous Translator
- Authors: Xiaofeng Liu, Fangxu Xing, Jerry L. Prince, Jiachen Zhuo, Maureen
Stone, Georges El Fakhri, Jonghye Woo
- Abstract summary: We develop an end-to-end deep learning framework to translate from a sequence of tagged-MRI to its corresponding audio waveform with limited dataset size.
Our framework is based on a novel fully convolutional asymmetry translator with guidance of a self residual attention strategy.
Our experimental results, carried out with a total of 63 tagged-MRI sequences alongside speech acoustics, showed that our framework enabled the generation of clear audio waveforms.
- Score: 12.685817926272161
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the underlying relationship between tongue and oropharyngeal
muscle deformation seen in tagged-MRI and intelligible speech plays an
important role in advancing speech motor control theories and treatment of
speech related-disorders. Because of their heterogeneous representations,
however, direct mapping between the two modalities -- i.e., two-dimensional
(mid-sagittal slice) plus time tagged-MRI sequence and its corresponding
one-dimensional waveform -- is not straightforward. Instead, we resort to
two-dimensional spectrograms as an intermediate representation, which contains
both pitch and resonance, from which to develop an end-to-end deep learning
framework to translate from a sequence of tagged-MRI to its corresponding audio
waveform with limited dataset size.~Our framework is based on a novel fully
convolutional asymmetry translator with guidance of a self residual attention
strategy to specifically exploit the moving muscular structures during
speech.~In addition, we leverage a pairwise correlation of the samples with the
same utterances with a latent space representation disentanglement
strategy.~Furthermore, we incorporate an adversarial training approach with
generative adversarial networks to offer improved realism on our generated
spectrograms.~Our experimental results, carried out with a total of 63
tagged-MRI sequences alongside speech acoustics, showed that our framework
enabled the generation of clear audio waveforms from a sequence of tagged-MRI,
surpassing competing methods. Thus, our framework provides the great potential
to help better understand the relationship between the two modalities.
Related papers
- Multimodal Segmentation for Vocal Tract Modeling [4.95865031722089]
Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech.
We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach.
We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators.
arXiv Detail & Related papers (2024-06-22T06:44:38Z) - Speech motion anomaly detection via cross-modal translation of 4D motion
fields from tagged MRI [12.515470808059666]
We aim to develop a framework for detecting speech motion anomalies in conjunction with their corresponding speech acoustics.
This is achieved through the use of a deep cross-modal translator trained on data from healthy individuals only.
A one-class SVM is then used to distinguish the spectrograms of healthy individuals from those of patients.
arXiv Detail & Related papers (2024-02-10T16:16:24Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix
Factorization via Plastic Transformer [11.91784203088159]
We develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms.
Our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models.
arXiv Detail & Related papers (2023-09-26T00:21:17Z) - Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology
Report Generation [48.723504098917324]
We propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments.
We introduce three novel modules: Latent Space Unifier, Cross-modal Representation Aligner and Text-to-Image Refiner.
Experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.
arXiv Detail & Related papers (2023-03-28T12:42:12Z) - Joint fMRI Decoding and Encoding with Latent Embedding Alignment [77.66508125297754]
We introduce a unified framework that addresses both fMRI decoding and encoding.
Our model concurrently recovers visual stimuli from fMRI signals and predicts brain activity from images within a unified framework.
arXiv Detail & Related papers (2023-03-26T14:14:58Z) - Synthesizing audio from tongue motion during speech using tagged MRI via
transformer [13.442093381065268]
We present an efficient deformation-decoder translation network for exploring the predictive information inherent in 4D motion fields via 2D spectrograms.
Our framework has the potential to improve our understanding of the relationship between these two modalities and inform the development of treatments for speech disorders.
arXiv Detail & Related papers (2023-02-14T17:27:55Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z) - Continuous Speech Separation with Conformer [60.938212082732775]
We use transformer and conformer in lieu of recurrent neural networks in the separation system.
We believe capturing global information with the self-attention based method is crucial for the speech separation.
arXiv Detail & Related papers (2020-08-13T09:36:05Z) - Learning Joint Articulatory-Acoustic Representations with Normalizing
Flows [7.183132975698293]
We find a joint latent representation between the articulatory and acoustic domain for vowel sounds via invertible neural network models.
Our approach achieves both articulatory-to-acoustic as well as acoustic-to-articulatory mapping, thereby demonstrating our success in achieving a joint encoding of both the domains.
arXiv Detail & Related papers (2020-05-16T04:34:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.