Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix
Factorization via Plastic Transformer
- URL: http://arxiv.org/abs/2309.14586v1
- Date: Tue, 26 Sep 2023 00:21:17 GMT
- Title: Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix
Factorization via Plastic Transformer
- Authors: Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Sidney Fels,
Jerry L. Prince, Georges El Fakhri, Jonghye Woo
- Abstract summary: We develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms.
Our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models.
- Score: 11.91784203088159
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The tongue's intricate 3D structure, comprising localized functional units,
plays a crucial role in the production of speech. When measured using tagged
MRI, these functional units exhibit cohesive displacements and derived
quantities that facilitate the complex process of speech production.
Non-negative matrix factorization-based approaches have been shown to estimate
the functional units through motion features, yielding a set of building blocks
and a corresponding weighting map. Investigating the link between weighting
maps and speech acoustics can offer significant insights into the intricate
process of speech production. To this end, in this work, we utilize
two-dimensional spectrograms as a proxy representation, and develop an
end-to-end deep learning framework for translating weighting maps to their
corresponding audio waveforms. Our proposed plastic light transformer (PLT)
framework is based on directional product relative position bias and
single-level spatial pyramid pooling, thus enabling flexible processing of
weighting maps with variable size to fixed-size spectrograms, without input
information loss or dimension expansion. Additionally, our PLT framework
efficiently models the global correlation of wide matrix input. To improve the
realism of our generated spectrograms with relatively limited training samples,
we apply pair-wise utterance consistency with Maximum Mean Discrepancy
constraint and adversarial training. Experimental results on a dataset of 29
subjects speaking two utterances demonstrated that our framework is able to
synthesize speech audio waveforms from weighting maps, outperforming
conventional convolution and transformer models.
Related papers
- Phononic materials with effectively scale-separated hierarchical features using interpretable machine learning [57.91994916297646]
Architected hierarchical phononic materials have sparked promise tunability of elastodynamic waves and vibrations over multiple frequency ranges.
In this article, hierarchical unit-cells are obtained, where features at each length scale result in a band gap within a targeted frequency range.
Our approach offers a flexible and efficient method for the exploration of new regions in the hierarchical design space.
arXiv Detail & Related papers (2024-08-15T21:35:06Z) - Simulating Articulatory Trajectories with Phonological Feature Interpolation [15.482738311360972]
We investigate the forward mapping between pseudo-motor commands and articulatory trajectories.
Two phonological feature sets, based respectively on generative and articulatory phonology, are used to encode a phonetic target sequence.
We discuss the implications of our results for our understanding of the dynamics of biological motion.
arXiv Detail & Related papers (2024-08-08T10:51:16Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Synthesizing audio from tongue motion during speech using tagged MRI via
transformer [13.442093381065268]
We present an efficient deformation-decoder translation network for exploring the predictive information inherent in 4D motion fields via 2D spectrograms.
Our framework has the potential to improve our understanding of the relationship between these two modalities and inform the development of treatments for speech disorders.
arXiv Detail & Related papers (2023-02-14T17:27:55Z) - Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention
Guided Heterogeneous Translator [12.685817926272161]
We develop an end-to-end deep learning framework to translate from a sequence of tagged-MRI to its corresponding audio waveform with limited dataset size.
Our framework is based on a novel fully convolutional asymmetry translator with guidance of a self residual attention strategy.
Our experimental results, carried out with a total of 63 tagged-MRI sequences alongside speech acoustics, showed that our framework enabled the generation of clear audio waveforms.
arXiv Detail & Related papers (2022-06-05T23:08:34Z) - A microstructure estimation Transformer inspired by sparse
representation for diffusion MRI [11.761543033212797]
We present a learning-based framework based on Transformer for dMRI-based microstructure estimation with downsampled q-space data.
The proposed method achieved up to 11.25 folds of acceleration in scan time and outperformed the other state-of-the-art learning-based methods.
arXiv Detail & Related papers (2022-05-13T05:14:22Z) - How to See Hidden Patterns in Metamaterials with Interpretable Machine
Learning [82.67551367327634]
We develop a new interpretable, multi-resolution machine learning framework for finding patterns in the unit-cells of materials.
Specifically, we propose two new interpretable representations of metamaterials, called shape-frequency features and unit-cell templates.
arXiv Detail & Related papers (2021-11-10T21:19:02Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - A Deep Joint Sparse Non-negative Matrix Factorization Framework for
Identifying the Common and Subject-specific Functional Units of Tongue Motion
During Speech [7.870139900799612]
We develop a new deep learning framework to identify common and subject-specific functional units of tongue motion during speech.
We transform NMF with sparse and graph regularizations into modular architectures akin to deep neural networks.
arXiv Detail & Related papers (2020-07-09T15:05:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.