Improving Ultrasound Tongue Image Reconstruction from Lip Images Using
Self-supervised Learning and Attention Mechanism
- URL: http://arxiv.org/abs/2106.11769v1
- Date: Sun, 20 Jun 2021 10:51:23 GMT
- Title: Improving Ultrasound Tongue Image Reconstruction from Lip Images Using
Self-supervised Learning and Attention Mechanism
- Authors: Haiyang Liu, Jihan Zhang
- Abstract summary: Given an observable image sequences of lips, can we picture the corresponding tongue motion?
We formulated this problem as the self-supervised learning problem, and employ the two-stream convolutional network and long-short memory network for the learning task, with the attention mechanism.
The results show that our model is able to generate images that close to the real ultrasound tongue images, and results in the matching between two imaging modalities.
- Score: 1.52292571922932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech production is a dynamic procedure, which involved multi human organs
including the tongue, jaw and lips. Modeling the dynamics of the vocal tract
deformation is a fundamental problem to understand the speech, which is the
most common way for human daily communication. Researchers employ several
sensory streams to describe the process simultaneously, which are
incontrovertibly statistically related to other streams. In this paper, we
address the following question: given an observable image sequences of lips,
can we picture the corresponding tongue motion. We formulated this problem as
the self-supervised learning problem, and employ the two-stream convolutional
network and long-short memory network for the learning task, with the attention
mechanism. We evaluate the performance of the proposed method by leveraging the
unlabeled lip videos to predict an upcoming ultrasound tongue image sequence.
The results show that our model is able to generate images that close to the
real ultrasound tongue images, and results in the matching between two imaging
modalities.
Related papers
- Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech [29.510756530126837]
We introduce a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech.
We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data.
arXiv Detail & Related papers (2024-09-23T20:19:24Z) - Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework.
Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - TExplain: Explaining Learned Visual Features via Pre-trained (Frozen) Language Models [14.019349267520541]
We propose a novel method that leverages the capabilities of language models to interpret the learned features of pre-trained image classifiers.
Our approach generates a vast number of sentences to explain the features learned by the classifier for a given image.
Our method, for the first time, utilizes these frequent words corresponding to a visual representation to provide insights into the decision-making process.
arXiv Detail & Related papers (2023-09-01T20:59:46Z) - Multimodal Neurons in Pretrained Text-Only Transformers [52.20828443544296]
We identify "multimodal neurons" that convert visual representations into corresponding text.
We show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
arXiv Detail & Related papers (2023-08-03T05:27:12Z) - Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person.
OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z) - Self-supervised Contrastive Video-Speech Representation Learning for
Ultrasound [15.517484333872277]
In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access.
We propose to address the problem of self-supervised representation learning with multi-modal ultrasound video-speech raw data.
arXiv Detail & Related papers (2020-08-14T23:58:23Z) - Towards Unsupervised Learning for Instrument Segmentation in Robotic
Surgery with Cycle-Consistent Adversarial Networks [54.00217496410142]
We propose an unpaired image-to-image translation where the goal is to learn the mapping between an input endoscopic image and a corresponding annotation.
Our approach allows to train image segmentation models without the need to acquire expensive annotations.
We test our proposed method on Endovis 2017 challenge dataset and show that it is competitive with supervised segmentation methods.
arXiv Detail & Related papers (2020-07-09T01:39:39Z) - Deep Learning for Automatic Tracking of Tongue Surface in Real-time
Ultrasound Videos, Landmarks instead of Contours [0.6853165736531939]
This paper presents a new novel approach of automatic and real-time tongue contour tracking using deep neural networks.
In the proposed method, instead of the two-step procedure, landmarks of the tongue surface are tracked.
Our experiment disclosed the outstanding performances of the proposed technique in terms of generalization, performance, and accuracy.
arXiv Detail & Related papers (2020-03-16T00:38:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.