End-to-End Learning of Speech 2D Feature-Trajectory for Prosthetic Hands
- URL: http://arxiv.org/abs/2009.10283v1
- Date: Tue, 22 Sep 2020 02:31:00 GMT
- Title: End-to-End Learning of Speech 2D Feature-Trajectory for Prosthetic Hands
- Authors: Mohsen Jafarzadeh, Yonas Tadesse
- Abstract summary: We propose an end-to-end convolutional neural network (CNN) that maps speech 2D features directly to trajectories for prosthetic hands.
The network is written in Python with Keras library that has a corresponding backend.
We optimized the CNN for NVIDIA Jetson TX2 developer kit.
- Score: 0.48951183832371004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech is one of the most common forms of communication in humans. Speech
commands are essential parts of multimodal controlling of prosthetic hands. In
the past decades, researchers used automatic speech recognition systems for
controlling prosthetic hands by using speech commands. Automatic speech
recognition systems learn how to map human speech to text. Then, they used
natural language processing or a look-up table to map the estimated text to a
trajectory. However, the performance of conventional speech-controlled
prosthetic hands is still unsatisfactory. Recent advancements in
general-purpose graphics processing units (GPGPUs) enable intelligent devices
to run deep neural networks in real-time. Thus, architectures of intelligent
systems have rapidly transformed from the paradigm of composite subsystems
optimization to the paradigm of end-to-end optimization. In this paper, we
propose an end-to-end convolutional neural network (CNN) that maps speech 2D
features directly to trajectories for prosthetic hands. The proposed
convolutional neural network is lightweight, and thus it runs in real-time in
an embedded GPGPU. The proposed method can use any type of speech 2D feature
that has local correlations in each dimension such as spectrogram, MFCC, or
PNCC. We omit the speech to text step in controlling the prosthetic hand in
this paper. The network is written in Python with Keras library that has a
TensorFlow backend. We optimized the CNN for NVIDIA Jetson TX2 developer kit.
Our experiment on this CNN demonstrates a root-mean-square error of 0.119 and
20ms running time to produce trajectory outputs corresponding to the voice
input data. To achieve a lower error in real-time, we can optimize a similar
CNN for a more powerful embedded GPGPU such as NVIDIA AGX Xavier.
Related papers
- EfficientSpeech: An On-Device Text to Speech Model [15.118059441365343]
State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices.
In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed.
arXiv Detail & Related papers (2023-05-23T10:28:41Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to
Speech [7.476901945542385]
We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models.
Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module.
Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
arXiv Detail & Related papers (2022-03-31T07:25:11Z) - DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders.
As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z) - On-device neural speech synthesis [3.716815259884143]
Tacotron and WaveRNN have made it possible to construct a fully neural network based TTS system.
We present key modeling improvements and optimization strategies that enable deploying these models on GPU servers and on mobile devices.
The proposed system can generate high-quality 24 kHz speech at 5x faster than real time on server and 3x faster than real time on mobile devices.
arXiv Detail & Related papers (2021-09-17T18:31:31Z) - SpeechBrain: A General-Purpose Speech Toolkit [73.0404642815335]
SpeechBrain is an open-source and all-in-one speech toolkit.
It is designed to facilitate the research and development of neural speech processing technologies.
It achieves competitive or state-of-the-art performance in a wide range of speech benchmarks.
arXiv Detail & Related papers (2021-06-08T18:22:56Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - 3D Convolutional Neural Networks for Ultrasound-Based Silent Speech
Interfaces [0.0]
Silent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue.
Deep neural networks are the most successful technology for this task.
One option for this is to apply recurrent neural structures such as the long short-term memory network (LSTM) in combination with 2D convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-04-23T10:56:34Z) - Applying GPGPU to Recurrent Neural Network Language Model based Fast
Network Search in the Real-Time LVCSR [5.0555627833288]
Recurrent Neural Network Language Models (RNNLMs) have started to be used in various fields of speech recognition.
High computational complexity of RNNLMs has been a hurdle in applying the RNNLM to a real-time Large Vocabulary Continuous Speech Recognition.
arXiv Detail & Related papers (2020-07-23T05:15:14Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z) - Neural Human Video Rendering by Learning Dynamic Textures and
Rendering-to-Video Translation [99.64565200170897]
We propose a novel human video synthesis method by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space.
We show several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.
arXiv Detail & Related papers (2020-01-14T18:06:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.