WaDeNet: Wavelet Decomposition based CNN for Speech Processing
- URL: http://arxiv.org/abs/2011.05594v1
- Date: Wed, 11 Nov 2020 06:43:03 GMT
- Title: WaDeNet: Wavelet Decomposition based CNN for Speech Processing
- Authors: Prithvi Suresh and Abhijith Ragav
- Abstract summary: WaDeNet is an end-to-end model for mobile speech processing.
WaDeNet embeds wavelet decomposition of the speech signal within the architecture.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing speech processing systems consist of different modules, individually
optimized for a specific task such as acoustic modelling or feature extraction.
In addition to not assuring optimality of the system, the disjoint nature of
current speech processing systems make them unsuitable for ubiquitous health
applications. We propose WaDeNet, an end-to-end model for mobile speech
processing. In order to incorporate spectral features, WaDeNet embeds wavelet
decomposition of the speech signal within the architecture. This allows WaDeNet
to learn from spectral features in an end-to-end manner, thus alleviating the
need for feature extraction and successive modules that are currently present
in speech processing systems. WaDeNet outperforms the current state of the art
in datasets that involve speech for mobile health applications such as
non-invasive emotion recognition. WaDeNet achieves an average increase in
accuracy of 6.36% when compared to the existing state of the art models.
Additionally, WaDeNet is considerably lighter than a simple CNNs with a similar
architecture.
Related papers
- TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down
Fusion [21.278294846228935]
Top-Down-Fusion Net (TDFNet) is a state-of-the-art (SOTA) model for audio-visual speech separation.
TDFNet achieves a performance increase of up to 10% across all performance metrics compared with the previous SOTA method CTCNet.
arXiv Detail & Related papers (2024-01-25T13:47:22Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Leveraging Symmetrical Convolutional Transformer Networks for Speech to
Singing Voice Style Transfer [49.01417720472321]
We develop a novel neural network architecture, called SymNet, which models the alignment of the input speech with the target melody.
Experiments are performed on the NUS and NHSS datasets which consist of parallel data of speech and singing voice.
arXiv Detail & Related papers (2022-08-26T02:54:57Z) - Speech-enhanced and Noise-aware Networks for Robust Speech Recognition [25.279902171523233]
A noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition.
The two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task.
Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively.
arXiv Detail & Related papers (2022-03-25T15:04:51Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Shennong: a Python toolbox for audio speech features extraction [15.816237141746562]
Shennong is a Python toolbox and command-line utility for speech features extraction.
It implements a wide range of well-established state of art algorithms including spectro-temporal filters, pre-trained neural networks, pitch estimators and speaker normalization methods.
This paper illustrates its use on three applications: a comparison of speech features performances on a phones discrimination task, an analysis of a Vocal Tract Length Normalization model as a function of the speech duration used for training and a comparison of pitch estimation algorithms under various noise conditions.
arXiv Detail & Related papers (2021-12-10T14:08:52Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for
Voice Activity Detection [30.46050153776374]
Voice activity detection (VAD) makes a distinction between speech and non-speech.
Deep neural network (DNN)-based VADs have achieved better performance than conventional signal processing methods.
This paper proposes an adaptive multiple receptive-field attention neural network, called MLNET, to finish VAD task.
arXiv Detail & Related papers (2020-08-13T02:24:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.