Convolutional Neural Network-Based Age Estimation Using B-Mode
Ultrasound Tongue Image
- URL: http://arxiv.org/abs/2101.11245v1
- Date: Wed, 27 Jan 2021 08:00:47 GMT
- Title: Convolutional Neural Network-Based Age Estimation Using B-Mode
Ultrasound Tongue Image
- Authors: Kele Xu and Tamas G\'abor Csap\'o and Ming Feng
- Abstract summary: We explore the feasibility of age estimation using the ultrasound tongue image of the speakers.
Motivated by the success of deep learning, this paper leverages deep learning on this task.
The developed method can be used a tool to evaluate the performance of speech therapy sessions.
- Score: 10.100437437151621
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Ultrasound tongue imaging is widely used for speech production research, and
it has attracted increasing attention as its potential applications seem to be
evident in many different fields, such as the visual biofeedback tool for
second language acquisition and silent speech interface. Unlike previous
studies, here we explore the feasibility of age estimation using the ultrasound
tongue image of the speakers. Motivated by the success of deep learning, this
paper leverages deep learning on this task. We train a deep convolutional
neural network model on the UltraSuite dataset. The deep model achieves mean
absolute error (MAE) of 2.03 for the data from typically developing children,
while MAE is 4.87 for the data from the children with speech sound disorders,
which suggest that age estimation using ultrasound is more challenging for the
children with speech sound disorder. The developed method can be used a tool to
evaluate the performance of speech therapy sessions. It is also worthwhile to
notice that, although we leverage the ultrasound tongue imaging for our study,
the proposed methods may also be extended to other imaging modalities (e.g.
MRI) to assist the studies on speech production.
Related papers
- Exploring Multimodal Approaches for Alzheimer's Disease Detection Using
Patient Speech Transcript and Audio Data [10.782153332144533]
Alzheimer's disease (AD) is a common form of dementia that severely impacts patient health.
This study investigates various methods for detecting AD using patients' speech and transcripts data from the DementiaBank Pitt database.
arXiv Detail & Related papers (2023-07-05T12:40:11Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Voice-assisted Image Labelling for Endoscopic Ultrasound Classification
using Neural Networks [48.732863591145964]
We propose a multi-modal convolutional neural network architecture that labels endoscopic ultrasound (EUS) images from raw verbal comments provided by a clinician during the procedure.
Our results show a prediction accuracy of 76% at image level on a dataset with 5 different labels.
arXiv Detail & Related papers (2021-10-12T21:22:24Z) - Improving Ultrasound Tongue Image Reconstruction from Lip Images Using
Self-supervised Learning and Attention Mechanism [1.52292571922932]
Given an observable image sequences of lips, can we picture the corresponding tongue motion?
We formulated this problem as the self-supervised learning problem, and employ the two-stream convolutional network and long-short memory network for the learning task, with the attention mechanism.
The results show that our model is able to generate images that close to the real ultrasound tongue images, and results in the matching between two imaging modalities.
arXiv Detail & Related papers (2021-06-20T10:51:23Z) - Self-supervised Contrastive Video-Speech Representation Learning for
Ultrasound [15.517484333872277]
In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access.
We propose to address the problem of self-supervised representation learning with multi-modal ultrasound video-speech raw data.
arXiv Detail & Related papers (2020-08-14T23:58:23Z) - Ultra2Speech -- A Deep Learning Framework for Formant Frequency
Estimation and Tracking from Ultrasound Tongue Images [5.606679908174784]
This work addresses the arttory-to-acoustic mapping problem based on ultrasound (US) tongue images.
We use a novel deep learning architecture to map US tongue images from the US placed beneath a subject's chin to formants that we call, Ultrasound2Formant (U2F) Net.
arXiv Detail & Related papers (2020-06-29T20:42:11Z) - Deep Learning for Automatic Tracking of Tongue Surface in Real-time
Ultrasound Videos, Landmarks instead of Contours [0.6853165736531939]
This paper presents a new novel approach of automatic and real-time tongue contour tracking using deep neural networks.
In the proposed method, instead of the two-step procedure, landmarks of the tongue surface are tracked.
Our experiment disclosed the outstanding performances of the proposed technique in terms of generalization, performance, and accuracy.
arXiv Detail & Related papers (2020-03-16T00:38:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.