Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using
Spatial Transformer Networks
- URL: http://arxiv.org/abs/2305.19130v3
- Date: Tue, 17 Oct 2023 08:01:34 GMT
- Title: Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using
Spatial Transformer Networks
- Authors: L\'aszl\'o T\'oth, Amin Honarmandi Shandiz, G\'abor Gosztolya, Csap\'o
Tam\'as G\'abor
- Abstract summary: Silent speech interfaces (SSI) are able to synthesize intelligible speech from articulatory movement data under certain conditions.
The resulting models are speaker-specific, making a quick switch between users troublesome.
We extend our deep networks with a spatial transformer network (STN) module, capable of performing an affine transformation on the input images.
- Score: 0.24466725954625895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Thanks to the latest deep learning algorithms, silent speech interfaces (SSI)
are now able to synthesize intelligible speech from articulatory movement data
under certain conditions. However, the resulting models are rather
speaker-specific, making a quick switch between users troublesome. Even for the
same speaker, these models perform poorly cross-session, i.e. after dismounting
and re-mounting the recording equipment. To aid quick speaker and session
adaptation of ultrasound tongue imaging-based SSI models, we extend our deep
networks with a spatial transformer network (STN) module, capable of performing
an affine transformation on the input images. Although the STN part takes up
only about 10% of the network, our experiments show that adapting just the STN
module might allow to reduce MSE by 88% on the average, compared to retraining
the whole network. The improvement is even larger (around 92%) when adapting
the network to different recording sessions from the same speaker.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Unifying Speech Enhancement and Separation with Gradient Modulation for
End-to-End Noise-Robust Speech Separation [23.758202121043805]
We propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness.
Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets.
arXiv Detail & Related papers (2023-02-22T03:54:50Z) - Speech Enhancement for Virtual Meetings on Cellular Networks [1.487576938041254]
We study speech enhancement using deep learning (DL) for virtual meetings on cellular devices.
We collect a transmitted DNS (t-DNS) dataset using Zoom Meetings over T-Mobile network.
The goal of this project is to enhance the speech transmitted over the cellular networks using deep learning models.
arXiv Detail & Related papers (2023-02-02T04:35:48Z) - BayesSpeech: A Bayesian Transformer Network for Automatic Speech
Recognition [0.0]
Recent developments using End-to-End Deep Learning models have been shown to have near or better performance than state of the art Recurrent Neural Networks (RNNs) on Automatic Speech Recognition tasks.
We show how the introduction of variance in the weights leads to faster training time and near state-of-the-art performance on LibriSpeech-960.
arXiv Detail & Related papers (2023-01-16T16:19:04Z) - CHAPTER: Exploiting Convolutional Neural Network Adapters for
Self-supervised Speech Models [62.60723685118747]
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data.
We propose an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor.
We empirically found that adding CNN to the feature extractor can help the adaptation on emotion and speaker tasks.
arXiv Detail & Related papers (2022-12-01T08:50:12Z) - Dynamic Slimmable Denoising Network [64.77565006158895]
Dynamic slimmable denoising network (DDSNet) is a general method to achieve good denoising quality with less computational complexity.
OurNet is empowered with the ability of dynamic inference by a dynamic gate.
Our experiments demonstrate our-Net consistently outperforms the state-of-the-art individually trained static denoising networks.
arXiv Detail & Related papers (2021-10-17T22:45:33Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z) - Ultra2Speech -- A Deep Learning Framework for Formant Frequency
Estimation and Tracking from Ultrasound Tongue Images [5.606679908174784]
This work addresses the arttory-to-acoustic mapping problem based on ultrasound (US) tongue images.
We use a novel deep learning architecture to map US tongue images from the US placed beneath a subject's chin to formants that we call, Ultrasound2Formant (U2F) Net.
arXiv Detail & Related papers (2020-06-29T20:42:11Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Sparse Mixture of Local Experts for Efficient Speech Enhancement [19.645016575334786]
We investigate a deep learning approach for speech denoising through an efficient ensemble of specialist neural networks.
By splitting up the speech denoising task into non-overlapping subproblems, we are able to improve denoising performance while also reducing computational complexity.
Our findings demonstrate that a fine-tuned ensemble network is able to exceed the speech denoising capabilities of a generalist network.
arXiv Detail & Related papers (2020-05-16T23:23:22Z) - Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments.
We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances.
We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.