MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI
- URL: http://arxiv.org/abs/2412.18836v2
- Date: Fri, 17 Jan 2025 12:18:44 GMT
- Title: MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI
- Authors: Neil Shah, Ayan Kashyap, Shirish Karande, Vineet Gandhi,
- Abstract summary: We introduce a novel approach that adapts the multi-modal self-supervised AV-HuBERT model for text prediction from rtMRI.
The predicted text and durations are then used by a speech decoder to synthesize aligned speech in any novel voice.
Our method achieves a $15.18%$ Word Error Rate (WER) on the USC-TIMIT MRI corpus, marking a huge improvement over the current state-of-the-art.
- Score: 23.54023878857057
- License:
- Abstract: Previous real-time MRI (rtMRI)-based speech synthesis models depend heavily on noisy ground-truth speech. Applying loss directly over ground truth mel-spectrograms entangles speech content with MRI noise, resulting in poor intelligibility. We introduce a novel approach that adapts the multi-modal self-supervised AV-HuBERT model for text prediction from rtMRI and incorporates a new flow-based duration predictor for speaker-specific alignment. The predicted text and durations are then used by a speech decoder to synthesize aligned speech in any novel voice. We conduct thorough experiments on two datasets and demonstrate our method's generalization ability to unseen speakers. We assess our framework's performance by masking parts of the rtMRI video to evaluate the impact of different articulators on text prediction. Our method achieves a $15.18\%$ Word Error Rate (WER) on the USC-TIMIT MRI corpus, marking a huge improvement over the current state-of-the-art. Speech samples are available at https://mri2speech.github.io/MRI2Speech/
Related papers
- Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
SpeechSSM learns from and sample long-form spoken audio in a single decoding session without text intermediates.
New embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long.
arXiv Detail & Related papers (2024-12-24T18:56:46Z) - Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech [29.510756530126837]
We introduce a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech.
We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data.
arXiv Detail & Related papers (2024-09-23T20:19:24Z) - Multimodal Segmentation for Vocal Tract Modeling [4.95865031722089]
Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech.
We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach.
We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators.
arXiv Detail & Related papers (2024-06-22T06:44:38Z) - Toward Joint Language Modeling for Speech Units and Text [89.32163954508489]
We explore joint language modeling for speech units and text.
We introduce automatic metrics to evaluate how well the joint LM mixes speech and text.
Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks.
arXiv Detail & Related papers (2023-10-12T20:53:39Z) - RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting
Self-Supervised Representations [13.995231731152462]
We propose RobustL2S, a modularized framework for Lip-to-Speech synthesis.
A non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content.
A vocoder then converts the speech features into raw waveforms.
arXiv Detail & Related papers (2023-07-03T09:13:57Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Learning Speaker-specific Lip-to-Speech Generation [28.620557933595585]
This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers.
We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements.
We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks.
arXiv Detail & Related papers (2022-06-04T19:40:02Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in
Real-Time MRI [9.614694312155798]
We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production.
The proposed framework comprises of convolutions, recurrent network, and connectionist temporal classification loss, trained entirely end-to-end.
To the best of our knowledge, this is the first study that demonstrates the recognition of entire spoken sentence based on an individual's arttory motions captured by rtMRI video.
arXiv Detail & Related papers (2021-06-16T11:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.