Related papers: End-to-End Learning of Speech 2D Feature-Trajectory for Prosthetic Hands

End-to-End Learning of Speech 2D Feature-Trajectory for Prosthetic Hands

URL: http://arxiv.org/abs/2009.10283v1
Date: Tue, 22 Sep 2020 02:31:00 GMT
Title: End-to-End Learning of Speech 2D Feature-Trajectory for Prosthetic Hands
Authors: Mohsen Jafarzadeh, Yonas Tadesse
Abstract summary: We propose an end-to-end convolutional neural network (CNN) that maps speech 2D features directly to trajectories for prosthetic hands. The network is written in Python with Keras library that has a corresponding backend. We optimized the CNN for NVIDIA Jetson TX2 developer kit.
Score: 0.48951183832371004
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech is one of the most common forms of communication in humans. Speech commands are essential parts of multimodal controlling of prosthetic hands. In the past decades, researchers used automatic speech recognition systems for controlling prosthetic hands by using speech commands. Automatic speech recognition systems learn how to map human speech to text. Then, they used natural language processing or a look-up table to map the estimated text to a trajectory. However, the performance of conventional speech-controlled prosthetic hands is still unsatisfactory. Recent advancements in general-purpose graphics processing units (GPGPUs) enable intelligent devices to run deep neural networks in real-time. Thus, architectures of intelligent systems have rapidly transformed from the paradigm of composite subsystems optimization to the paradigm of end-to-end optimization. In this paper, we propose an end-to-end convolutional neural network (CNN) that maps speech 2D features directly to trajectories for prosthetic hands. The proposed convolutional neural network is lightweight, and thus it runs in real-time in an embedded GPGPU. The proposed method can use any type of speech 2D feature that has local correlations in each dimension such as spectrogram, MFCC, or PNCC. We omit the speech to text step in controlling the prosthetic hand in this paper. The network is written in Python with Keras library that has a TensorFlow backend. We optimized the CNN for NVIDIA Jetson TX2 developer kit. Our experiment on this CNN demonstrates a root-mean-square error of 0.119 and 20ms running time to produce trajectory outputs corresponding to the voice input data. To achieve a lower error in real-time, we can optimize a similar CNN for a more powerful embedded GPGPU such as NVIDIA AGX Xavier.

Related papers

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow [57.51550409392103]
We introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos.<n>To address these challenges, we decompose the speech signal into manageable subspaces, each representing distinct speech attributes, and predict them directly from the visual input.<n>To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture.
arXiv Detail & Related papers (2024-11-29T05:55:20Z)
EfficientSpeech: An On-Device Text to Speech Model [15.118059441365343]
State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices. In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed.
arXiv Detail & Related papers (2023-05-23T10:28:41Z)
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts. Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment. We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z)
JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech [7.476901945542385]
We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models. Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
arXiv Detail & Related papers (2022-03-31T07:25:11Z)
DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders. As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z)
On-device neural speech synthesis [3.716815259884143]
Tacotron and WaveRNN have made it possible to construct a fully neural network based TTS system. We present key modeling improvements and optimization strategies that enable deploying these models on GPU servers and on mobile devices. The proposed system can generate high-quality 24 kHz speech at 5x faster than real time on server and 3x faster than real time on mobile devices.
arXiv Detail & Related papers (2021-09-17T18:31:31Z)
SpeechBrain: A General-Purpose Speech Toolkit [73.0404642815335]
SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies. It achieves competitive or state-of-the-art performance in a wide range of speech benchmarks.
arXiv Detail & Related papers (2021-06-08T18:22:56Z)
End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces [0.0]
Silent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue. Deep neural networks are the most successful technology for this task. One option for this is to apply recurrent neural structures such as the long short-term memory network (LSTM) in combination with 2D convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-04-23T10:56:34Z)
Applying GPGPU to Recurrent Neural Network Language Model based Fast Network Search in the Real-Time LVCSR [5.0555627833288]
Recurrent Neural Network Language Models (RNNLMs) have started to be used in various fields of speech recognition. High computational complexity of RNNLMs has been a hurdle in applying the RNNLM to a real-time Large Vocabulary Continuous Speech Recognition.
arXiv Detail & Related papers (2020-07-23T05:15:14Z)
An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs) Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z)
Neural Human Video Rendering by Learning Dynamic Textures and Rendering-to-Video Translation [99.64565200170897]
We propose a novel human video synthesis method by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space. We show several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.
arXiv Detail & Related papers (2020-01-14T18:06:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.