SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network
- URL: http://arxiv.org/abs/2408.00788v1
- Date: Wed, 17 Jul 2024 15:22:52 GMT
- Title: SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network
- Authors: Kexin Wang, Jiahong Zhang, Yong Ren, Man Yao, Di Shang, Bo Xu, Guoqi Li,
- Abstract summary: Spiking Neural Network (SNN) has demonstrated its effectiveness and efficiency in vision, natural language, and speech understanding tasks.
We design textbfSpikeVoice, which performs high-quality Text-To-Speech (TTS) via SNN, to explore the potential of SNN to "speak"
- Score: 21.487450282438125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Brain-inspired Spiking Neural Network (SNN) has demonstrated its effectiveness and efficiency in vision, natural language, and speech understanding tasks, indicating their capacity to "see", "listen", and "read". In this paper, we design \textbf{SpikeVoice}, which performs high-quality Text-To-Speech (TTS) via SNN, to explore the potential of SNN to "speak". A major obstacle to using SNN for such generative tasks lies in the demand for models to grasp long-term dependencies. The serial nature of spiking neurons, however, leads to the invisibility of information at future spiking time steps, limiting SNN models to capture sequence dependencies solely within the same time step. We term this phenomenon "partial-time dependency". To address this issue, we introduce Spiking Temporal-Sequential Attention STSA in the SpikeVoice. To the best of our knowledge, SpikeVoice is the first TTS work in the SNN field. We perform experiments using four well-established datasets that cover both Chinese and English languages, encompassing scenarios with both single-speaker and multi-speaker configurations. The results demonstrate that SpikeVoice can achieve results comparable to Artificial Neural Networks (ANN) with only 10.5 energy consumption of ANN.
Related papers
- DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement [3.409728296852651]
Speech enhancement improves communication in noisy environments, affecting areas such as automatic speech recognition, hearing aids, and telecommunications.
Neuromorphic algorithms in the form of spiking neural networks (SNNs) have great potential.
We develop a two-phase time-domain streaming SNN framework -- the Dual-Path Spiking Neural Network (DPSNN)
arXiv Detail & Related papers (2024-08-14T09:08:43Z) - Spiking Convolutional Neural Networks for Text Classification [15.10637945787922]
Spiking neural networks (SNNs) offer a promising pathway to implement deep neural networks (DNNs) in a more energy-efficient manner.
This work presents a "conversion + fine-tuning" two-step method for training SNNs for text classification and proposes a simple but effective way to encode pre-trained word embeddings as spike trains.
arXiv Detail & Related papers (2024-06-27T14:54:27Z) - sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection
with Spiking Neural Networks [51.516451451719654]
Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient.
This paper introduces a novel SNN-based Voice Activity Detection model, referred to as sVAD.
It provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms.
arXiv Detail & Related papers (2024-03-09T02:55:44Z) - LC-TTFS: Towards Lossless Network Conversion for Spiking Neural Networks
with TTFS Coding [55.64533786293656]
We show that our algorithm can achieve a near-perfect mapping between the activation values of an ANN and the spike times of an SNN on a number of challenging AI tasks.
The study paves the way for deploying ultra-low-power TTFS-based SNNs on power-constrained edge computing platforms.
arXiv Detail & Related papers (2023-10-23T14:26:16Z) - Co-learning synaptic delays, weights and adaptation in spiking neural
networks [0.0]
Spiking neural networks (SNN) distinguish themselves from artificial neural networks (ANN) because of their inherent temporal processing and spike-based computations.
We show that data processing with spiking neurons can be enhanced by co-learning the connection weights with two other biologically inspired neuronal features.
arXiv Detail & Related papers (2023-09-12T09:13:26Z) - Single Channel Speech Enhancement Using U-Net Spiking Neural Networks [2.436681150766912]
Speech enhancement (SE) is crucial for reliable communication devices or robust speech recognition systems.
We propose a novel approach to SE using a spiking neural network (SNN) based on a U-Net architecture.
SNNs are suitable for processing data with a temporal dimension, such as speech, and are known for their energy-efficient implementation on neuromorphic hardware.
arXiv Detail & Related papers (2023-07-26T19:10:29Z) - Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot
Task Generalization [61.60501633397704]
We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.
We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts.
Experiments show that our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets.
arXiv Detail & Related papers (2023-05-18T16:32:58Z) - Uncovering the Representation of Spiking Neural Networks Trained with
Surrogate Gradient [11.0542573074431]
Spiking Neural Networks (SNNs) are recognized as the candidate for the next-generation neural networks due to their bio-plausibility and energy efficiency.
Recently, researchers have demonstrated that SNNs are able to achieve nearly state-of-the-art performance in image recognition tasks using surrogate gradient training.
arXiv Detail & Related papers (2023-04-25T19:08:29Z) - A journey in ESN and LSTM visualisations on a language task [77.34726150561087]
We trained ESNs and LSTMs on a Cross-Situationnal Learning (CSL) task.
The results are of three kinds: performance comparison, internal dynamics analyses and visualization of latent space.
arXiv Detail & Related papers (2020-12-03T08:32:01Z) - Deep Time Delay Neural Network for Speech Enhancement with Full Data
Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z) - Progressive Tandem Learning for Pattern Recognition with Deep Spiking
Neural Networks [80.15411508088522]
Spiking neural networks (SNNs) have shown advantages over traditional artificial neural networks (ANNs) for low latency and high computational efficiency.
We propose a novel ANN-to-SNN conversion and layer-wise learning framework for rapid and efficient pattern recognition.
arXiv Detail & Related papers (2020-07-02T15:38:44Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.