Multimodal Segmentation for Vocal Tract Modeling
- URL: http://arxiv.org/abs/2406.15754v1
- Date: Sat, 22 Jun 2024 06:44:38 GMT
- Title: Multimodal Segmentation for Vocal Tract Modeling
- Authors: Rishi Jain, Bohan Yu, Peter Wu, Tejas Prabhune, Gopala Anumanchipalli,
- Abstract summary: Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech.
We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach.
We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators.
- Score: 4.95865031722089
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at \url{rishiraij.github.io/multimodal-mri-avatar/}.
Related papers
- A multimodal LLM for the non-invasive decoding of spoken text from brain recordings [0.4187344935012482]
We propose and end-to-end multimodal LLM for decoding spoken text from fMRI signals.
The proposed architecture is founded on (i) an encoder derived from a specific transformer incorporating an augmented embedding layer for the encoder and a better-adjusted attention mechanism than that present in the state of the art.
A benchmark in performed on a corpus consisting of a set of interactions human-human and human-robot interactions where fMRI and conversational signals are recorded synchronously.
arXiv Detail & Related papers (2024-09-29T14:03:39Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - MindFormer: Semantic Alignment of Multi-Subject fMRI for Brain Decoding [50.55024115943266]
We introduce a novel semantic alignment method of multi-subject fMRI signals using so-called MindFormer.
This model is specifically designed to generate fMRI-conditioned feature vectors that can be used for conditioning Stable Diffusion model for fMRI- to-image generation or large language model (LLM) for fMRI-to-text generation.
Our experimental results demonstrate that MindFormer generates semantically consistent images and text across different subjects.
arXiv Detail & Related papers (2024-05-28T00:36:25Z) - Improve Cross-Modality Segmentation by Treating MRI Images as Inverted CT Scans [0.4867169878981935]
We show that a simple image inversion technique can significantly improve the segmentation quality of CT segmentation models on MRI data.
Image inversion is straightforward to implement and does not require dedicated graphics processing units (GPUs)
arXiv Detail & Related papers (2024-05-04T14:02:52Z) - NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level Modulation [55.51412454263856]
This paper proposes to directly modulate the generation process of diffusion models using fMRI signals.
By training with about 67,000 fMRI-image pairs from various individuals, our model enjoys superior fMRI-to-image decoding capacity.
arXiv Detail & Related papers (2024-03-27T02:42:52Z) - SegmentAnyBone: A Universal Model that Segments Any Bone at Any Location
on MRI [13.912230325828943]
We propose a versatile, publicly available deep-learning model for bone segmentation in MRI across multiple standard MRI locations.
The proposed model can operate in two modes: fully automated segmentation and prompt-based segmentation.
Our contributions include (1) collecting and annotating a new MRI dataset across various MRI protocols, encompassing over 300 annotated volumes and 8485 annotated slices across diverse anatomic regions.
arXiv Detail & Related papers (2024-01-23T18:59:25Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - Explainable unsupervised multi-modal image registration using deep
networks [2.197364252030876]
MRI image registration aims to geometrically 'pair' diagnoses from different modalities, time points and slices.
In this work, we show that our DL model becomes fully explainable, setting the framework to generalise our approach on further medical imaging data.
arXiv Detail & Related papers (2023-08-03T19:13:48Z) - Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology
Report Generation [48.723504098917324]
We propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments.
We introduce three novel modules: Latent Space Unifier, Cross-modal Representation Aligner and Text-to-Image Refiner.
Experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.
arXiv Detail & Related papers (2023-03-28T12:42:12Z) - Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention
Guided Heterogeneous Translator [12.685817926272161]
We develop an end-to-end deep learning framework to translate from a sequence of tagged-MRI to its corresponding audio waveform with limited dataset size.
Our framework is based on a novel fully convolutional asymmetry translator with guidance of a self residual attention strategy.
Our experimental results, carried out with a total of 63 tagged-MRI sequences alongside speech acoustics, showed that our framework enabled the generation of clear audio waveforms.
arXiv Detail & Related papers (2022-06-05T23:08:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.