Multimodal Segmentation for Vocal Tract Modeling
- URL: http://arxiv.org/abs/2406.15754v1
- Date: Sat, 22 Jun 2024 06:44:38 GMT
- Title: Multimodal Segmentation for Vocal Tract Modeling
- Authors: Rishi Jain, Bohan Yu, Peter Wu, Tejas Prabhune, Gopala Anumanchipalli,
- Abstract summary: Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech.
We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach.
We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators.
- Score: 4.95865031722089
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at \url{rishiraij.github.io/multimodal-mri-avatar/}.
Related papers
- ContextMRI: Enhancing Compressed Sensing MRI through Metadata Conditioning [51.26601171361753]
We propose ContextMRI, a text-conditioned diffusion model for MRI that integrates granular metadata into the reconstruction process.
We show that increasing the fidelity of metadata, ranging from slice location and contrast to patient age, sex, and pathology, systematically boosts reconstruction performance.
arXiv Detail & Related papers (2025-01-08T05:15:43Z) - Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks [1.0177118388531323]
Manual segmentation is time intensive and susceptible to errors.
This study aimed to evaluate the efficacy of deep learning algorithms for automatic vocal tract segmentation from 3D MRI.
arXiv Detail & Related papers (2025-01-08T00:19:52Z) - MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI [23.54023878857057]
We introduce a novel approach that adapts the multi-modal self-supervised AV-HuBERT model for text prediction from rtMRI.
The predicted text and durations are then used by a speech decoder to synthesize aligned speech in any novel voice.
Our method achieves a $15.18%$ Word Error Rate (WER) on the USC-TIMIT MRI corpus, marking a huge improvement over the current state-of-the-art.
arXiv Detail & Related papers (2024-12-25T08:49:43Z) - MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities [59.61465292965639]
This paper investigates a new paradigm for leveraging generative models in medical applications.
We propose a diffusion-based data engine, termed MRGen, which enables generation conditioned on text prompts and masks.
arXiv Detail & Related papers (2024-12-04T16:34:22Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - MindFormer: Semantic Alignment of Multi-Subject fMRI for Brain Decoding [50.55024115943266]
We introduce a novel semantic alignment method of multi-subject fMRI signals using so-called MindFormer.
This model is specifically designed to generate fMRI-conditioned feature vectors that can be used for conditioning Stable Diffusion model for fMRI- to-image generation or large language model (LLM) for fMRI-to-text generation.
Our experimental results demonstrate that MindFormer generates semantically consistent images and text across different subjects.
arXiv Detail & Related papers (2024-05-28T00:36:25Z) - Improve Cross-Modality Segmentation by Treating MRI Images as Inverted CT Scans [0.4867169878981935]
We show that a simple image inversion technique can significantly improve the segmentation quality of CT segmentation models on MRI data.
Image inversion is straightforward to implement and does not require dedicated graphics processing units (GPUs)
arXiv Detail & Related papers (2024-05-04T14:02:52Z) - NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level Modulation [55.51412454263856]
This paper proposes to directly modulate the generation process of diffusion models using fMRI signals.
By training with about 67,000 fMRI-image pairs from various individuals, our model enjoys superior fMRI-to-image decoding capacity.
arXiv Detail & Related papers (2024-03-27T02:42:52Z) - SegmentAnyBone: A Universal Model that Segments Any Bone at Any Location
on MRI [13.912230325828943]
We propose a versatile, publicly available deep-learning model for bone segmentation in MRI across multiple standard MRI locations.
The proposed model can operate in two modes: fully automated segmentation and prompt-based segmentation.
Our contributions include (1) collecting and annotating a new MRI dataset across various MRI protocols, encompassing over 300 annotated volumes and 8485 annotated slices across diverse anatomic regions.
arXiv Detail & Related papers (2024-01-23T18:59:25Z) - Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology
Report Generation [48.723504098917324]
We propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments.
We introduce three novel modules: Latent Space Unifier, Cross-modal Representation Aligner and Text-to-Image Refiner.
Experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.
arXiv Detail & Related papers (2023-03-28T12:42:12Z) - Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention
Guided Heterogeneous Translator [12.685817926272161]
We develop an end-to-end deep learning framework to translate from a sequence of tagged-MRI to its corresponding audio waveform with limited dataset size.
Our framework is based on a novel fully convolutional asymmetry translator with guidance of a self residual attention strategy.
Our experimental results, carried out with a total of 63 tagged-MRI sequences alongside speech acoustics, showed that our framework enabled the generation of clear audio waveforms.
arXiv Detail & Related papers (2022-06-05T23:08:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.